A Theory of Unsupervised Speech Recognition

Liming Wang; Mark Hasegawa-Johnson; Chang D. Yoo

A Theory of Unsupervised Speech Recognition

Liming Wang, Mark Hasegawa-Johnson, Chang D. Yoo

📝 Paper

Anthology

Underline 🪧 Poster 🧑‍🏫 Slides 📺 Watch Video on Underline Add to Favorites

Main: Speech and Multimodality Main-poster Paper

Poster Session 6: Speech and Multimodality (Poster)

Conference Room: Frontenac Ballroom and Queen's Quay

Conference Time: July 12, 09:00-10:30 (EDT) (America/Toronto)

Global Time: July 12, Poster Session 6 (13:00-14:30 UTC)

Keywords: automatic speech recognition

TLDR: Unsupervised speech recognition (\{pasted macro `ASRU'\}/) is the problem of learning automatic speech recognition (ASR) systems from \emph{unpaired} speech-only and text-only corpora. While various algorithms exist to solve this problem, a theoretical framework is missing to study their properties ...

You can open the #paper-P1275 channel in a separate window.

Abstract: Unsupervised speech recognition (\{pasted macro `ASRU'\}/) is the problem of learning automatic speech recognition (ASR) systems from \emph{unpaired} speech-only and text-only corpora. While various algorithms exist to solve this problem, a theoretical framework is missing to study their properties and address such issues as sensitivity to hyperparameters and training instability. In this paper, we proposed a general theoretical framework to study the properties of \{pasted macro `ASRU'\}/ systems based on random matrix theory and the theory of neural tangent kernels. Such a framework allows us to prove various learnability conditions and sample complexity bounds of \{pasted macro `ASRU'\}/. Extensive \{pasted macro `ASRU'\}/ experiments on synthetic languages with three classes of transition graphs provide strong empirical evidence for our theory (code available at {https://github.com/cactuswiththoughts/UnsupASRTheory.git}{cactuswiththoughts/UnsupASRTheory.git}).