Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages
Yasmine Karoui, Rémi Lebret, Negar Foroutan Eghlidi, Karl Aberer
Main: Multilingualism and Cross-Lingual NLP Main-poster Paper
Poster Session 7: Multilingualism and Cross-Lingual NLP (Poster)
Conference Room: Frontenac Ballroom and Queen's Quay
Conference Time: July 12, 11:00-12:30 (EDT) (America/Toronto)
Global Time: July 12, Poster Session 7 (15:00-16:30 UTC)
Keywords:
cross-lingual transfer
TLDR:
Vision-Language Pre-training (VLP) has advanced the performance of many vision-language tasks, such as image-text retrieval, visual entailment, and visual reasoning.
The pre-training mostly utilizes lexical databases and image queries in English. Previous work has demonstrated that the pre-training ...
You can open the
#paper-P5297
channel in a separate window.
Abstract:
Vision-Language Pre-training (VLP) has advanced the performance of many vision-language tasks, such as image-text retrieval, visual entailment, and visual reasoning.
The pre-training mostly utilizes lexical databases and image queries in English. Previous work has demonstrated that the pre-training in English does not transfer well to other languages in a zero-shot setting. However, multilingual pre-trained language models (MPLM) have excelled at a variety of single-modal language tasks. In this paper, we propose a simple yet efficient approach to adapt VLP to unseen languages using MPLM.
We utilize a cross-lingual contextualised token embeddings alignment approach to train text encoders for non-English languages. Our approach does not require image input and primarily uses machine translation, eliminating the need for target language data. Our evaluation across three distinct tasks (image-text retrieval, visual entailment, and natural language visual reasoning) demonstrates that this approach outperforms the state-of-the-art multilingual vision-language models without requiring large parallel corpora. Our code is available at https://github.com/Yasminekaroui/CliCoTea.