SERENGETI: Massively Multilingual Language Models for Africa

Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed, Alcides Alcoba Inciarte

Findings: Linguistic Diversity Findings Paper

Session 7: Linguistic Diversity (Virtual Poster)
Conference Room: Pier 7&8
Conference Time: July 12, 11:00-12:30 (EDT) (America/Toronto)
Global Time: July 12, Session 7 (15:00-16:30 UTC)
Spotlight Session: Spotlight - Metropolitan West (Spotlight)
Conference Room: Metropolitan West
Conference Time: July 10, 19:00-21:00 (EDT) (America/Toronto)
Global Time: July 10, Spotlight Session (23:00-01:00 UTC)
Keywords: less-resourced languages, endangered languages, indigenous languages, minoritized languages, language documentation, resources for less-resourced languages, software and tools
Languages: 517 african languages, language varieties
TLDR: Multilingual pretrained language models (mPLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task-specific finetuning. To date, only ~31 out of ~2,000 African languages are covered in existing language models. We ameliorate this ...
You can open the #paper-P4395 channel in a separate window.
Abstract: Multilingual pretrained language models (mPLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task-specific finetuning. To date, only ~31 out of ~2,000 African languages are covered in existing language models. We ameliorate this limitation by developing SERENGETI, a set of massively multilingual language model that covers 517 African languages and language varieties. We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages. SERENGETI outperforms other models on 11 datasets across the eights tasks, achieving 82.27 average F\_1. We also perform analyses of errors from our models, which allows us to investigate the influence of language genealogy and linguistic similarity when the models are applied under zero-shot settings. We will publicly release our models for research.{Anonymous link}