A Facial Expression-Aware Multimodal Multi-task Learning Framework for Emotion Recognition in Multi-party Conversations

Wenjie Zheng; Jianfei Yu; Rui Xia; Shijin Wang

A Facial Expression-Aware Multimodal Multi-task Learning Framework for Emotion Recognition in Multi-party Conversations

Wenjie Zheng, Jianfei Yu, Rui Xia, Shijin Wang

📝 Paper

Anthology

Underline 🪧 Poster 🧑‍🏫 Slides 📺 Watch Video on Underline Add to Favorites

Main: Speech and Multimodality Main-poster Paper

Poster Session 4: Speech and Multimodality (Poster)

Conference Room: Frontenac Ballroom and Queen's Quay

Conference Time: July 11, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 11, Poster Session 4 (15:00-16:30 UTC)

Keywords: video processing, multimodality

TLDR: Multimodal Emotion Recognition in Multiparty Conversations (MERMC) has recently attracted considerable attention. Due to the complexity of visual scenes in multi-party conversations, most previous MERMC studies mainly focus on text and audio modalities while ignoring visual information. Recently, se...

You can open the #paper-P4505 channel in a separate window.

Abstract: Multimodal Emotion Recognition in Multiparty Conversations (MERMC) has recently attracted considerable attention. Due to the complexity of visual scenes in multi-party conversations, most previous MERMC studies mainly focus on text and audio modalities while ignoring visual information. Recently, several works proposed to extract face sequences as visual features and have shown the importance of visual information in MERMC. However, given an utterance, the face sequence extracted by previous methods may contain multiple people's faces, which will inevitably introduce noise to the emotion prediction of the real speaker. To tackle this issue, we propose a two-stage framework named Facial expressionaware Multimodal Multi-Task learning (FacialMMT). Specifically, a pipeline method is first designed to extract the face sequence of the real speaker of each utterance, which consists of multimodal face recognition, unsupervised face clustering, and face matching. With the extracted face sequences, we propose a multimodal facial expression-aware emotion recognition model, which leverages the frame-level facial emotion distributions to help improve utterance-level emotion recognition based on multi-task learning. Experiments demonstrate the effectiveness of the proposed FacialMMT framework on the benchmark MELD dataset. The source code is publicly released at https://github.com/NUSTM/FacialMMT.