Enhancing Video Translation Context with Object Labels

Jeremy Gwinnup, Tim Anderson, Brian Ore, Eric Hansen, Kevin Duh

The 20th International Conference on Spoken Language Translation Long Paper

TLDR: We present a simple yet efficient method to enhance the quality of machine translation models trained on multimodal corpora by augmenting the training text with labels of detected objects in the corresponding video segments. We then test the effects of label augmentation in both baseline and two aut
You can open the #paper-IWSLT_10 channel in a separate window.
Abstract: We present a simple yet efficient method to enhance the quality of machine translation models trained on multimodal corpora by augmenting the training text with labels of detected objects in the corresponding video segments. We then test the effects of label augmentation in both baseline and two automatic speech recognition (ASR) conditions. In contrast with multimodal techniques that merge visual and textual features, our modular method is easy to implement and the results are more interpretable. Comparisons are made with Transformer translation architectures trained with baseline and augmented labels, showing improvements of up to +1.0 BLEU on the How2 dataset.