Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

Jan Deriu; Pius von Däniken; Don Tuggener; Mark Cieliebak

Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

Jan Deriu, Pius von Däniken, Don Tuggener, Mark Cieliebak

📝 Paper

Anthology

Underline 🧑‍🏫 Slides 📺 Watch Video on Underline Add to Favorites

Findings: Resources and Evaluation Findings Paper

Session 4: Resources and Evaluation (Virtual Poster)

Conference Room: Pier 7&8

Conference Time: July 11, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 11, Session 4 (15:00-16:30 UTC)

Spotlight Session: Spotlight - Metropolitan East (Spotlight)

Conference Room: Metropolitan East

Conference Time: July 10, 19:00-21:00 (EDT) (America/Toronto)

Global Time: July 10, Spotlight Session (23:00-01:00 UTC)

Keywords: evaluation methodologies, evaluation, statistical testing for evaluation

TLDR: A major challenge in the field of Text Generation is evaluation: Human evaluations are cost-intensive, and automated metrics often display considerable disagreements with human judgments. In this paper, we propose to apply automated metrics for Text Generation in a preference-based evaluation protoc...

You can open the #paper-P3059 channel in a separate window.

Abstract: A major challenge in the field of Text Generation is evaluation: Human evaluations are cost-intensive, and automated metrics often display considerable disagreements with human judgments. In this paper, we propose to apply automated metrics for Text Generation in a preference-based evaluation protocol. The protocol features a statistical model that incorporates various levels of uncertainty to account for the error-proneness of the metrics. We show that existing metrics are generally over-confident in assigning significant differences between systems. As a remedy, the model allows to combine human ratings with automated ratings. We show that it can reduce the required amounts of human ratings to arrive at robust and statistically significant results by more than 50\%, while yielding the same evaluation outcome as the pure human evaluation in 95\% of cases. We showcase the benefits of the evaluation protocol for three text generation tasks: dialogue systems, machine translation, and text summarization.