DITTO: Data-efficient and Fair Targeted Subset Selection for ASR Accent Adaptation

Suraj N Kothawade; Anmol Reddy Mekala; D.Chandra Sekhara SS Hetha Havya; Mayank Kothyari; Rishabh K Iyer; Ganesh Ramakrishnan; Preethi Jyothi

DITTO: Data-efficient and Fair Targeted Subset Selection for ASR Accent Adaptation

Suraj N Kothawade, Anmol Reddy Mekala, D.Chandra Sekhara SS Hetha Havya, Mayank Kothyari, Rishabh K Iyer, Ganesh Ramakrishnan, Preethi Jyothi

📝 Paper

Anthology

Underline 🪧 Poster 🧑‍🏫 Slides 📺 Watch Video on Underline Add to Favorites

Main: Theme: Reality Check Main-poster Paper

Session 4: Theme: Reality Check (Virtual Poster)

Conference Room: Pier 7&8

Conference Time: July 11, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 11, Session 4 (15:00-16:30 UTC)

Keywords: (non-)generalizability, evaluation, methodology

TLDR: State-of-the-art Automatic Speech Recognition (ASR) systems are known to exhibit disparate performance on varying speech accents. To improve performance on a specific target accent, a commonly adopted solution is to finetune the ASR model using accent-specific labeled speech. However, acquiring larg...

You can open the #paper-P1236 channel in a separate window.

Abstract: State-of-the-art Automatic Speech Recognition (ASR) systems are known to exhibit disparate performance on varying speech accents. To improve performance on a specific target accent, a commonly adopted solution is to finetune the ASR model using accent-specific labeled speech. However, acquiring large amounts of labeled speech for specific target accents is challenging. Choosing an informative subset of speech samples that are most representative of the target accents becomes important for effective ASR finetuning. To address this problem, we propose DITTO (Data-efficient and faIr Targeted subseT selectiOn that uses Submodular Mutual Information (SMI) functions as acquisition functions to find the most informative set of utterances matching a target accent within a fixed budget. An important feature of DITTO is that it supports fair targeting for multiple accents, i.e. it can automatically select representative data points from multiple accents when the ASR model needs to perform well on more than one accent. We show that compared to other speech selection methods, DITTO is 3-5 times as label-efficient for its improvements on the Indic-TTS and L2 datasets.