CKDST: Comprehensively and Effectively Distill Knowledge from Machine Translation to End-to-End Speech Translation
Yikun Lei, Zhengshan Xue, Xiaohu Zhao, Haoran Sun, Shaolin Zhu, xiaodong lin, Deyi Xiong
Findings: Machine Translation Findings Paper
Session 7: Machine Translation (Virtual Poster)
Conference Room: Pier 7&8
Conference Time: July 12, 11:00-12:30 (EDT) (America/Toronto)
Global Time: July 12, Session 7 (15:00-16:30 UTC)
Keywords:
speech translation
TLDR:
Distilling knowledge from a high-resource task, e.g., machine translation, is an effective way to alleviate the data scarcity problem of end-to-end speech translation.
However, previous works simply use the classical knowledge distillation that does not allow for adequate transfer of knowledge from...
You can open the
#paper-P2835
channel in a separate window.
Abstract:
Distilling knowledge from a high-resource task, e.g., machine translation, is an effective way to alleviate the data scarcity problem of end-to-end speech translation.
However, previous works simply use the classical knowledge distillation that does not allow for adequate transfer of knowledge from machine translation.
In this paper, we propose a comprehensive knowledge distillation framework for speech translation, CKDST, which is capable of comprehensively and effectively distilling knowledge from machine translation to speech translation from two perspectives: cross-modal contrastive representation distillation and simultaneous decoupled knowledge distillation.
In the former, we leverage a contrastive learning objective to optmize the mutual information between speech and text representations for representation distillation in the encoder.
In the later, we decouple the non-target class knowledge from target class knowledge for logits distillation in the decoder.
Experiments on the MuST-C benchmark dataset demonstrate that our CKDST substantially improves the baseline by 1.2 BLEU on average in all translation directions, and outperforms previous state-of-the-art end-to-end and cascaded speech translation models.