[SRW] Choosing What to Mask: More Informed Masking for Multimodal Machine Translation

Julia Sato; Helena Caseli; Lucia Specia

[SRW] Choosing What to Mask: More Informed Masking for Multimodal Machine Translation

Julia Sato, Helena Caseli, Lucia Specia

📝 Paper

Anthology

Underline 🪧 Poster 📺 Watch Video on Underline Add to Favorites

Student Research Workshop Srw Paper

Session 3: Student Research Workshop (Poster)

Conference Room: Frontenac Ballroom and Queen's Quay

Conference Time: July 11, 09:00-10:30 (EDT) (America/Toronto)

Global Time: July 11, Session 3 (13:00-14:30 UTC)

TLDR: Pre-trained language models have achieved remarkable results on several NLP tasks. Most of them adopt masked language modeling to learn representations by randomly masking tokens and predicting them based on their context. However, this random selection of tokens to be masked is inefficient to learn...

You can open the #paper-S82 channel in a separate window.

Abstract: Pre-trained language models have achieved remarkable results on several NLP tasks. Most of them adopt masked language modeling to learn representations by randomly masking tokens and predicting them based on their context. However, this random selection of tokens to be masked is inefficient to learn some language patterns as it may not consider linguistic information that can be helpful for many NLP tasks, such as multimodal machine translation (MMT). Hence, we propose three novel masking strategies for cross-lingual visual pre-training - more informed visual masking, more informed textual masking, and more informed visual and textual masking - each one focusing on learning different linguistic patterns. We apply them to Vision Translation Language Modelling for video subtitles (Sato et al., 2022) and conduct extensive experiments on the Portuguese-English MMT task. The results show that our masking approaches yield significant improvements over the original random masking strategy for downstream MMT performance. Our models outperform the MMT baseline and we achieve state-of-the-art accuracy (52.70 in terms of BLEU score) on the How2 dataset, indicating that more informed masking helps in acquiring an understanding of specific language structures and has great potential for language understanding.