Benchmarking Offensive and Abusive Language in Dutch Tweets

Tommaso Caselli, Hylke Van Der Veen

The 7th Workshop on Online Abuse and Harms (WOAH) Long paper Paper

TLDR: We present an extensive evaluation of different fine-tuned models to detect instances of offensive and abusive language in Dutch across three benchmarks: a standard held-out test, a task- agnostic functional benchmark, and a dynamic test set. We also investigate the use of data cartography to identi
You can open the #paper-ACL_26 channel in a separate window.
Abstract: We present an extensive evaluation of different fine-tuned models to detect instances of offensive and abusive language in Dutch across three benchmarks: a standard held-out test, a task- agnostic functional benchmark, and a dynamic test set. We also investigate the use of data cartography to identify high quality training data. Our results show a relatively good quality of the manually annotated data used to train the models while highlighting some critical weakness. We have also found a good portability of trained models along the same language phenomena. As for the data cartography, we have found a positive impact only on the functional benchmark and when selecting data per annotated dimension rather than using the entire training material.