With a Little Push, NLI Models can Robustly and Efficiently Predict Faithfulness
Julius Steen, Juri Opitz, Anette Frank, Katja Markert
Main: Generation Main-poster Paper
Poster Session 3: Generation (Poster)
Conference Room: Frontenac Ballroom and Queen's Quay
Conference Time: July 11, 09:00-10:30 (EDT) (America/Toronto)
Global Time: July 11, Poster Session 3 (13:00-14:30 UTC)
Keywords:
automatic evaluation
TLDR:
Conditional language models still generate unfaithful output that is not supported by their input. These unfaithful generations jeopardize trust in real-world applications such as summarization or human-machine interaction, motivating a need for automatic faithfulness metrics. To implement such metr...
You can open the
#paper-P3656
channel in a separate window.
Abstract:
Conditional language models still generate unfaithful output that is not supported by their input. These unfaithful generations jeopardize trust in real-world applications such as summarization or human-machine interaction, motivating a need for automatic faithfulness metrics. To implement such metrics, NLI models seem attractive, since they solve a strongly related task that comes with a wealth of prior research and data. But recent research
suggests that NLI models require costly additional machinery to perform reliably across datasets, e.g., by running inference on a cartesian product of input and generated sentences, or supporting them with a question-generation/answering step.
In this work we show that pure NLI models \_can\_ outperform more complex metrics when combining task-adaptive data augmentation with robust inference procedures. We propose: (1) Augmenting NLI training data to
adapt NL inferences to the specificities of faithfulness prediction in dialogue;
(2) Making use of both entailment and contradiction probabilities in NLI, and
(3) Using Monte-Carlo dropout during inference.
Applied to the TRUE benchmark, which combines faithfulness datasets across diverse domains and tasks, our approach strongly improves a vanilla NLI model and significantly outperforms previous work, while showing favourable computational cost.