Modality Adaption or Regularization? A Case Study on End-to-End Speech Translation

Yuchen Han, Chen Xu, Tong Xiao, Jingbo Zhu

Main: Speech and Multimodality Main-poster Paper

Session 7: Speech and Multimodality (Virtual Poster)
Conference Room: Pier 7&8
Conference Time: July 12, 11:00-12:30 (EDT) (America/Toronto)
Global Time: July 12, Session 7 (15:00-16:30 UTC)
Keywords: spoken language translation
TLDR: Pre-training and fine-tuning is a paradigm for alleviating the data scarcity problem in end-to-end speech translation (E2E ST). The commonplace ''modality gap'' between speech and text data often leads to inconsistent inputs between pre-training and fine-tuning. However, we observe that this gap occ...
You can open the #paper-P5404 channel in a separate window.
Abstract: Pre-training and fine-tuning is a paradigm for alleviating the data scarcity problem in end-to-end speech translation (E2E ST). The commonplace ''modality gap'' between speech and text data often leads to inconsistent inputs between pre-training and fine-tuning. However, we observe that this gap occurs in the early stages of fine-tuning, but does not have a major impact on the final performance. On the other hand, we find that there has another gap, which we call the ''capacity gap'': high resource tasks (such as ASR and MT) always require a large model to fit, when the model is reused for a low resource task (E2E ST), it will get a sub-optimal performance due to the over-fitting. In a case study, we find that the regularization plays a more important role than the well-designed modality adaption method, which achieves 29.0 for en-de and 40.3 for en-fr on the MuST-C dataset.