With a Little Push, NLI Models can Robustly and Efficiently Predict Faithfulness

Julius Steen, Juri Opitz, Anette Frank, Katja Markert

Main: Generation Main-poster Paper

Poster Session 3: Generation (Poster)
Conference Room: Frontenac Ballroom and Queen's Quay
Conference Time: July 11, 09:00-10:30 (EDT) (America/Toronto)
Global Time: July 11, Poster Session 3 (13:00-14:30 UTC)
Keywords: automatic evaluation
TLDR: Conditional language models still generate unfaithful output that is not supported by their input. These unfaithful generations jeopardize trust in real-world applications such as summarization or human-machine interaction, motivating a need for automatic faithfulness metrics. To implement such metr...
You can open the #paper-P3656 channel in a separate window.
Abstract: Conditional language models still generate unfaithful output that is not supported by their input. These unfaithful generations jeopardize trust in real-world applications such as summarization or human-machine interaction, motivating a need for automatic faithfulness metrics. To implement such metrics, NLI models seem attractive, since they solve a strongly related task that comes with a wealth of prior research and data. But recent research suggests that NLI models require costly additional machinery to perform reliably across datasets, e.g., by running inference on a cartesian product of input and generated sentences, or supporting them with a question-generation/answering step. In this work we show that pure NLI models \_can\_ outperform more complex metrics when combining task-adaptive data augmentation with robust inference procedures. We propose: (1) Augmenting NLI training data to adapt NL inferences to the specificities of faithfulness prediction in dialogue; (2) Making use of both entailment and contradiction probabilities in NLI, and (3) Using Monte-Carlo dropout during inference. Applied to the TRUE benchmark, which combines faithfulness datasets across diverse domains and tasks, our approach strongly improves a vanilla NLI model and significantly outperforms previous work, while showing favourable computational cost.