MeetingQA: Extractive Question-Answering on Meeting Transcripts

Archiki Prasad; Trung Bui; Seunghyun Yoon; Hanieh Deilamsalehy; Franck Dernoncourt; Mohit Bansal

MeetingQA: Extractive Question-Answering on Meeting Transcripts

Archiki Prasad, Trung Bui, Seunghyun Yoon, Hanieh Deilamsalehy, Franck Dernoncourt, Mohit Bansal

📝 Paper

Anthology

Underline 🪧 Poster 🧑‍🏫 Slides 📺 Watch Video on Underline Add to Favorites

Main: Question Answering Main-poster Paper

Poster Session 1: Question Answering (Poster)

Conference Room: Frontenac Ballroom and Queen's Quay

Conference Time: July 10, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 10, Poster Session 1 (15:00-16:30 UTC)

Keywords: conversational qa

TLDR: With the ubiquitous use of online meeting platforms and robust automatic speech recognition systems, meeting transcripts have emerged as a promising domain for natural language tasks. Most recent works on meeting transcripts primarily focus on summarization and extraction of action items. However, m...

You can open the #paper-P4357 channel in a separate window.

Abstract: With the ubiquitous use of online meeting platforms and robust automatic speech recognition systems, meeting transcripts have emerged as a promising domain for natural language tasks. Most recent works on meeting transcripts primarily focus on summarization and extraction of action items. However, meeting discussions also have a useful question-answering (QA) component, crucial to understanding the discourse or meeting content, and can be used to build interactive interfaces on top of long transcripts. Hence, in this work, we leverage this inherent QA component of meeting discussions and introduce MeetingQA, an extractive QA dataset comprising of questions asked by meeting participants and corresponding responses. As a result, questions can be open-ended and actively seek discussions, while the answers can be multi-span and distributed across multiple speakers. Our comprehensive empirical study of several robust baselines including long-context language models and recent instruction-tuned models reveals that models perform poorly on this task (F1 = 57.3) and severely lag behind human performance (F1 = 84.6), thus presenting a challenging new task for the community to improve upon.