An (unhelpful) guide to selecting the best ASR architecture for your under-resourced language

Robert Jimerson; Zoey Liu; Emily Prud'hommeaux

An (unhelpful) guide to selecting the best ASR architecture for your under-resourced language

Robert Jimerson, Zoey Liu, Emily Prud'hommeaux

📝 Paper

Anthology

Underline 🪧 Poster 📺 Watch Video on Underline Add to Favorites

Main: Linguistic Diversity Main-oral Paper

Session 3: Linguistic Diversity (Oral)

Conference Room: Pier 7&8

Conference Time: July 11, 09:00-10:30 (EDT) (America/Toronto)

Global Time: July 11, Session 3 (13:00-14:30 UTC)

Keywords: less-resourced languages, endangered languages, indigenous languages

TLDR: Advances in deep neural models for automatic speech recognition (ASR) have yielded dramatic improvements in ASR quality for resource-rich languages, with English ASR now achieving word error rates comparable to that of human transcribers. The vast majority of the world's languages, however, lack the...

You can open the #paper-P4315 channel in a separate window.

Abstract: Advances in deep neural models for automatic speech recognition (ASR) have yielded dramatic improvements in ASR quality for resource-rich languages, with English ASR now achieving word error rates comparable to that of human transcribers. The vast majority of the world's languages, however, lack the quantity of data necessary to approach this level of accuracy. In this paper we use four of the most popular ASR toolkits to train ASR models for eleven languages with limited ASR training resources: eleven widely spoken languages of Africa, Asia, and South America, one endangered language of Central America, and three critically endangered languages of North America. We find that no single architecture consistently outperforms any other. These differences in performance so far do not appear to be related to any particular feature of the datasets or characteristics of the languages. These findings have important implications for future research in ASR for under-resourced languages. ASR systems for languages with abundant existing media and available speakers may derive the most benefit simply by collecting large amounts of additional acoustic and textual training data. Communities using ASR to support endangered language documentation efforts, who cannot easily collect more data, might instead focus on exploring multiple architectures and hyperparameterizations to optimize performance within the constraints of their available data and resources.