Visually-Enhanced Phrase Understanding

Tsu-Yuan Hsu; Chen-An Li; Chao-Wei Huang; Yun-Nung Chen

Visually-Enhanced Phrase Understanding

Tsu-Yuan Hsu, Chen-An Li, Chao-Wei Huang, Yun-Nung Chen

📝 Paper

Anthology

Underline 🪧 Poster 🧑‍🏫 Slides 📺 Watch Video on Underline Add to Favorites

Findings: Speech and Multimodality Findings Paper

Session 1: Speech and Multimodality (Virtual Poster)

Conference Room: Pier 7&8

Conference Time: July 10, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 10, Session 1 (15:00-16:30 UTC)

Spotlight Session: Spotlight - Metropolitan Centre (Spotlight)

Conference Room: Metropolitan Centre

Conference Time: July 10, 19:00-21:00 (EDT) (America/Toronto)

Global Time: July 10, Spotlight Session (23:00-01:00 UTC)

Keywords: multimodality

TLDR: Large-scale vision-language pre-training has exhibited strong performance in various visual and textual understanding tasks. Recently, the textual encoders of multi-modal pre-trained models have been shown to generate high-quality textual representations, which often outperform models that are purel...

You can open the #paper-P5193 channel in a separate window.

Abstract: Large-scale vision-language pre-training has exhibited strong performance in various visual and textual understanding tasks. Recently, the textual encoders of multi-modal pre-trained models have been shown to generate high-quality textual representations, which often outperform models that are purely text-based, such as BERT. In this study, our objective is to utilize both textual and visual encoders of multi-modal pre-trained models to enhance language understanding tasks. We achieve this by generating an image associated with a textual prompt, thus enriching the representation of a phrase for downstream tasks. Results from experiments conducted on four benchmark datasets demonstrate that our proposed method, which leverages visually-enhanced text representations, significantly improves performance in the entity clustering task.