Tackling Modality Heterogeneity with Multi-View Calibration Network for Multimodal Sentiment Detection

yiwei wei; Shaozu Yuan; Ruosong Yang; Lei Shen; zhangmeizhi li; Longbiao Wang; Meng Chen

Tackling Modality Heterogeneity with Multi-View Calibration Network for Multimodal Sentiment Detection

yiwei wei, Shaozu Yuan, Ruosong Yang, Lei Shen, zhangmeizhi li, Longbiao Wang, Meng Chen

📝 Paper

Anthology

Underline 🪧 Poster 📺 Watch Video on Underline Add to Favorites

Main: Language Grounding to Vision, Robotics, and Beyond Main-poster Paper

Session 4: Language Grounding to Vision, Robotics, and Beyond (Virtual Poster)

Conference Room: Pier 7&8

Conference Time: July 11, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 11, Session 4 (15:00-16:30 UTC)

Keywords: cross-modal application

TLDR: With the popularity of social media, detecting sentiment from multimodal posts (e.g. image-text pairs) has attracted substantial attention recently. Existing works mainly focus on fusing different features but ignore the challenge of modality heterogeneity. Specifically, different modalities with in...

You can open the #paper-P3040 channel in a separate window.

Abstract: With the popularity of social media, detecting sentiment from multimodal posts (e.g. image-text pairs) has attracted substantial attention recently. Existing works mainly focus on fusing different features but ignore the challenge of modality heterogeneity. Specifically, different modalities with inherent disparities may bring three problems: 1) introducing redundant visual features during feature fusion; 2) causing feature shift in the representation space; 3) leading to inconsistent annotations for different modal data. All these issues will increase the difficulty in understanding the sentiment of the multimodal content. In this paper, we propose a novel Multi-View Calibration Network (MVCN) to alleviate the above issues systematically. We first propose a text-guided fusion module with novel Sparse-Attention to reduce the negative impacts of redundant visual elements. We then devise a sentiment-based congruity constraint task to calibrate the feature shift in the representation space. Finally, we introduce an adaptive loss calibration strategy to tackle inconsistent annotated labels. Extensive experiments demonstrate the competitiveness of MVCN against previous approaches and achieve state-of-the-art results on two public benchmark datasets.