Layer-wise Fusion with Modality Independence Modeling for Multi-modal Emotion Recognition

Jun Sun; Shoukang Han; Yu-Ping Ruan; Xiaoning Zhang; Shu-Kai Zheng; Yulong Liu; Yuxin Huang; Taihao Li

Layer-wise Fusion with Modality Independence Modeling for Multi-modal Emotion Recognition

Jun Sun, Shoukang Han, Yu-Ping Ruan, Xiaoning Zhang, Shu-Kai Zheng, Yulong Liu, Yuxin Huang, Taihao Li

📝 Paper

Anthology

Underline 🪧 Poster 📺 Watch Video on Underline Add to Favorites

Main: Speech and Multimodality Main-poster Paper

Poster Session 3: Speech and Multimodality (Poster)

Conference Room: Frontenac Ballroom and Queen's Quay

Conference Time: July 11, 09:00-10:30 (EDT) (America/Toronto)

Global Time: July 11, Poster Session 3 (13:00-14:30 UTC)

Keywords: multimodality

TLDR: Multi-modal emotion recognition has gained increasing attention in recent years due to its widespread applications and the advances in multi-modal learning approaches. However, previous studies primarily focus on developing models that exploit the unification of multiple modalities. In this paper, w...

You can open the #paper-P2776 channel in a separate window.

Abstract: Multi-modal emotion recognition has gained increasing attention in recent years due to its widespread applications and the advances in multi-modal learning approaches. However, previous studies primarily focus on developing models that exploit the unification of multiple modalities. In this paper, we propose that maintaining modality independence is beneficial for the model performance. According to this principle, we construct a dataset, and devise a multi-modal transformer model. The new dataset, CHinese Emotion Recognition dataset with Modality-wise Annotions, abbreviated as CHERMA, provides uni-modal labels for each individual modality, and multi-modal labels for all modalities jointly observed. The model consists of uni-modal transformer modules that learn representations for each modality, and a multi-modal transformer module that fuses all modalities. All the modules are supervised by their corresponding labels separately, and the forward information flow is uni-directionally from the uni-modal modules to the multi-modal module. The supervision strategy and the model architecture guarantee each individual modality learns its representation independently, and meanwhile the multi-modal module aggregates all information. Extensive empirical results demonstrate that our proposed scheme outperforms state-of-the-art alternatives, corroborating the importance of modality independence in multi-modal emotion recognition. The dataset and codes are availabel at {https://github.com/sunjunaimer/LFMIM}