A Synthetic Data Generation Framework for Grounded Dialogues

Jianzhu Bao; Rui Wang; Yasheng Wang; Aixin Sun; Yitong Li; Fei Mi; Ruifeng Xu

A Synthetic Data Generation Framework for Grounded Dialogues

Jianzhu Bao, Rui Wang, Yasheng Wang, Aixin Sun, Yitong Li, Fei Mi, Ruifeng Xu

📝 Paper

Anthology

Underline 🪧 Poster 📺 Watch Video on Underline Add to Favorites

Main: Dialogue and Interactive Systems Main-poster Paper

Session 1: Dialogue and Interactive Systems (Virtual Poster)

Conference Room: Pier 7&8

Conference Time: July 10, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 10, Session 1 (15:00-16:30 UTC)

Keywords: grounded dialog

TLDR: Training grounded response generation models often requires a large collection of grounded dialogues. However, it is costly to build such dialogues. In this paper, we present a synthetic data generation framework (SynDG) for grounded dialogues. The generation process utilizes large pre-trained langu...

You can open the #paper-P2766 channel in a separate window.

Abstract: Training grounded response generation models often requires a large collection of grounded dialogues. However, it is costly to build such dialogues. In this paper, we present a synthetic data generation framework (SynDG) for grounded dialogues. The generation process utilizes large pre-trained language models and freely available knowledge data (e.g., Wikipedia pages, persona profiles, etc.). The key idea of designing SynDG is to consider dialogue flow and coherence in the generation process. Specifically, given knowledge data, we first heuristically determine a dialogue flow, which is a series of knowledge pieces. Then, we employ T5 to incrementally turn the dialogue flow into a dialogue. To ensure coherence of both the dialogue flow and the synthetic dialogue, we design a two-level filtering strategy, at the flow-level and the utterance-level respectively. Experiments on two public benchmarks show that the synthetic grounded dialogue data produced by our framework is able to significantly boost model performance in both full training data and low-resource scenarios.