Empowering Cross-lingual Behavioral Testing of NLP Models with Typological Features

Ester Hlavnova; Sebastian Ruder

Empowering Cross-lingual Behavioral Testing of NLP Models with Typological Features

Ester Hlavnova, Sebastian Ruder

📝 Paper

Anthology

Underline 🪧 Poster 🧑‍🏫 Slides 📺 Watch Video on Underline Add to Favorites

Main: Multilingualism and Cross-Lingual NLP Main-oral Paper

Session 1: Multilingualism and Cross-Lingual NLP (Oral)

Conference Room: Pier 4&5

Conference Time: July 10, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 10, Session 1 (15:00-16:30 UTC)

Keywords: multilingualism, linguistic variation, multilingual benchmarks, multilingual evaluation

Languages: french, spanish, italian, russian, slovak, arabic, finnish, swedish, swahili, mandarin chinese, german

TLDR: A challenge towards developing NLP systems for the world's languages is understanding how they generalize to typological differences relevant for real-world applications. To this end, we propose M2C, a morphologically-aware framework for behavioral testing of NLP models. We use M2C to generate tests...

You can open the #paper-P3991 channel in a separate window.

Abstract: A challenge towards developing NLP systems for the world's languages is understanding how they generalize to typological differences relevant for real-world applications. To this end, we propose M2C, a morphologically-aware framework for behavioral testing of NLP models. We use M2C to generate tests that probe models' behavior in light of specific linguistic features in 12 typologically diverse languages. We evaluate state-of-the-art language models on the generated tests. While models excel at most tests in English, we highlight generalization failures to specific typological characteristics such as temporal expressions in Swahili and compounding possessives in Finish. Our findings motivate the development of models that address these blind spots.