A Simple Concatenation can Effectively Improve Speech Translation

Linlin Zhang, Kai Fan, Boxing Chen, Luo Si

Main: Machine Translation Main-poster Paper

Poster Session 2: Machine Translation (Poster)
Conference Room: Frontenac Ballroom and Queen's Quay
Conference Time: July 10, 14:00-15:30 (EDT) (America/Toronto)
Global Time: July 10, Poster Session 2 (18:00-19:30 UTC)
Keywords: speech translation
TLDR: A triple speech translation data comprises speech, transcription, and translation. In the end-to-end paradigm, text machine translation (MT) usually plays the role of a teacher model for the speech translation (ST) via knowledge distillation. Parameter sharing with the teacher is often adopted to c...
You can open the #paper-P3174 channel in a separate window.
Abstract: A triple speech translation data comprises speech, transcription, and translation. In the end-to-end paradigm, text machine translation (MT) usually plays the role of a teacher model for the speech translation (ST) via knowledge distillation. Parameter sharing with the teacher is often adopted to construct the ST model architecture, however, the two modalities are independently fed and trained via different losses. This situation does not match ST's properties across two modalities and also limits the upper bound of the performance. Inspired by the works of video Transformer, we propose a simple unified cross-modal ST method, which concatenates speech and text as the input, and builds a teacher that can utilize both cross-modal information simultaneously. Experimental results show that in our unified ST framework, models can effectively utilize the auxiliary information from speech and text, and achieve compelling results on MuST-C datasets.