Common Law Annotations: Investigating the Stability of Dialog System Output Annotations

Seunggun Lee, Alexandra DeLucia, Nikita Nangia, Praneeth Ganedi, Ryan Guan, Rubing Li, Britney Ngaw, Aditya Singhal, Shalaka Vaidya, Zijun Yuan, Lining Zhang, João Sedoc

The 17th Linguistic Annotation Workshop (LAW-XVII) \\ @ ACL 2023 Paper

TLDR: Metrics for Inter-Annotator Agreement (IAA), like Cohen's Kappa, are crucial for validating annotated datasets. Although high agreement is often used to show the reliability of annotation procedures, it is insufficient to ensure validity or reproducibility. While researchers are encouraged to increa
You can open the #paper-LAW_F5 channel in a separate window.
Abstract: Metrics for Inter-Annotator Agreement (IAA), like Cohen's Kappa, are crucial for validating annotated datasets. Although high agreement is often used to show the reliability of annotation procedures, it is insufficient to ensure validity or reproducibility. While researchers are encouraged to increase annotator agreement, this can lead to specific and tailored annotation guidelines. We hypothesize that this may result in diverging annotations from different groups. To study this, we first propose the Lee et al. Protocol (LEAP), a standardized and codified annotation protocol. LEAP strictly enforces transparency in the annotation process, which ensures reproducibility of annotation guidelines. Using LEAP to annotate a dialog dataset, we empirically show that while research groups may create reliable guidelines by raising agreement, this can cause divergent annotations across different research groups, thus questioning the validity of the annotations. Therefore, we caution NLP researchers against using reliability as a proxy for reproducibility and validity.