WeCheck: Strong Factual Consistency Checker via Weakly Supervised Learning

Wenhao Wu; Wei Li; Xinyan Xiao; Jiachen Liu; Sujian Li; Yajuan Lyu

WeCheck: Strong Factual Consistency Checker via Weakly Supervised Learning

Wenhao Wu, Wei Li, Xinyan Xiao, Jiachen Liu, Sujian Li, Yajuan Lyu

📝 Paper

Anthology

Underline 🪧 Poster 📺 Watch Video on Underline Add to Favorites

Main: Generation Main-poster Paper

Poster Session 2: Generation (Poster)

Conference Room: Frontenac Ballroom and Queen's Quay

Conference Time: July 10, 14:00-15:30 (EDT) (America/Toronto)

Global Time: July 10, Poster Session 2 (18:00-19:30 UTC)

Keywords: automatic evaluation

TLDR: A crucial issue of current text generation models is that they often uncontrollably generate text that is factually inconsistent with inputs. Due to lack of annotated data, existing factual consistency metrics usually train evaluation models on synthetic texts or directly transfer from other related...

You can open the #paper-P4838 channel in a separate window.

Abstract: A crucial issue of current text generation models is that they often uncontrollably generate text that is factually inconsistent with inputs. Due to lack of annotated data, existing factual consistency metrics usually train evaluation models on synthetic texts or directly transfer from other related tasks, such as question answering (QA) and natural language inference (NLI). Bias in synthetic text or upstream tasks makes them perform poorly on text actually generated by language models, especially for general evaluation for various tasks. To alleviate this problem, we propose a weakly supervised framework named \textbf{WeCheck} that is directly trained on actual generated samples from language models with weakly annotated labels. WeCheck first utilizes a generative model to infer the factual labels of generated samples by aggregating weak labels from multiple resources. Next, we train a simple noise-aware classification model as the target metric using the inferred weakly supervised information. Comprehensive experiments on various tasks demonstrate the strong performance of WeCheck, achieving an average absolute improvement of 3.3\% on the TRUE benchmark over 11B state-of-the-art methods using only 435M parameters. Furthermore, it is up to 30 times faster than previous evaluation methods, greatly improving the accuracy and efficiency of factual consistency evaluation.