Question-Answering in a Low-resourced Language: Benchmark Dataset and Models for Tigrinya

Fitsum Gaim, Wonsuk Yang, Hancheol Park, Jong Park

Main: Linguistic Diversity Main-oral Paper

Session 3: Linguistic Diversity (Oral)
Conference Room: Pier 7&8
Conference Time: July 11, 09:00-10:30 (EDT) (America/Toronto)
Global Time: July 11, Session 3 (13:00-14:30 UTC)
Keywords: less-resourced languages
Languages: tigrinya, east african, afro-asiatic, semitic
TLDR: Question-Answering (QA) has seen significant advances recently, achieving near human-level performance over some benchmarks. However, these advances focus on high-resourced languages such as English, while the task remains unexplored for most other languages, mainly due to the lack of annotated data...
You can open the #paper-P1365 channel in a separate window.
Abstract: Question-Answering (QA) has seen significant advances recently, achieving near human-level performance over some benchmarks. However, these advances focus on high-resourced languages such as English, while the task remains unexplored for most other languages, mainly due to the lack of annotated datasets. This work presents a native QA dataset for an East African language, Tigrinya. The dataset contains 10.6K question-answer pairs spanning 572 paragraphs extracted from 290 news articles on various topics. The dataset construction method is discussed, which is applicable to constructing similar resources for related languages. We present comprehensive experiments and analyses of several resource-efficient approaches to QA, including monolingual, cross-lingual, and multilingual setups, along with comparisons against machine-translated silver data. Our strong baseline models reach 76\% in the F1 score, while the estimated human performance is 92\%, indicating that the benchmark presents a good challenge for future work. We make the dataset, models, and leaderboard publicly available.