GPL at SemEval-2023 Task 1: WordNet and CLIP to Disambiguate Images

Shibingfeng Zhang, Shantanu Nath, Davide Mazzaccara

The 17th International Workshop on Semantic Evaluation (SemEval-2023) Task-1 - visual word sense disambiguation (visual-wsd) Paper

TLDR: Given a word in context, the task of VisualWord Sense Disambiguation consists of select-ing the correct image among a set of candidates.To select the correct image, we propose a so-lution blending text augmentation and multi-modal models. Text augmentation leverages thefine-grained semantic annotati
You can open the #paper-SemEval_239 channel in a separate window.
Abstract: Given a word in context, the task of VisualWord Sense Disambiguation consists of select-ing the correct image among a set of candidates.To select the correct image, we propose a so-lution blending text augmentation and multi-modal models. Text augmentation leverages thefine-grained semantic annotation from Word-Net to get a better representation of the tex-tual component. We then compare this sense-augmented text to the set of image using pre-trained multimodal models CLIP and ViLT. Oursystem has been ranked 16th for the Englishlanguage, achieving 68.5 points for hit rate and79.2 for mean reciprocal rank.