MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African languages

Cheikh M. Bamba Dione, David Ifeoluwa Adelani, Peter Nabende, Jesujoba Alabi, Thapelo Andrew Sindane, Happy Buzaaba, Shamsuddeen Hassan Muhammad, Chris Chinenye Emezue, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jonathan Mukiibi, Blessing K Sibanda, Bonaventure F. P. Dossou, Andiswa Bukula, Rooweither Mabuya, Allahsera Auguste Tapo, Edwin Munkoh-Buabeng, victoire Memdjokam Koagne, Fatoumata Ouoba Kabore, Amelia Taylor, Godson K KALIPE, Tebogo Macucwa, Vukosi Marivate, Tajuddeen Gwadabe, Mboning Tchiaze Elvis, Ikechukwu Ekene Onyenwe, Gratien G. Atindogbe, Tolulope Anu Adelani, Idris Akinade, Olanrewaju Samuel, Marien NAHIMANA, Théogène MUSABEYEZU, Emile Niyomutabazi, Ester Chimhenga, Kudzai Gotosa, Patrick Mizha, Apelete AGBOLO, SEYDOU TRAORE, Chinedu Uchechukwu, Aliyu Yakubu Yusuf, Muhammad Sulaiman Abdullahi, Dietrich Klakow

Main: Linguistic Diversity Main-oral Paper

Session 3: Linguistic Diversity (Oral)
Conference Room: Pier 7&8
Conference Time: July 11, 09:00-10:30 (EDT) (America/Toronto)
Global Time: July 11, Session 3 (13:00-14:30 UTC)
Keywords: part-of-speech tagging, low-resources languages pos tagging, parsing and related tasks
Languages: african languages
TLDR: In this paper, we present AfricaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the universal dependencies (UD) guidelines. We conducted extensive POS baseline experiments using both c...
You can open the #paper-P377 channel in a separate window.
Abstract: In this paper, we present AfricaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the universal dependencies (UD) guidelines. We conducted extensive POS baseline experiments using both conditional random field and several multilingual pre-trained language models. We applied various cross-lingual transfer models trained with data available in the UD. Evaluating on the AfricaPOS dataset, we show that choosing the best transfer language(s) in both single-source and multi-source setups greatly improves the POS tagging performance of the target languages, in particular when combined with parameter-fine-tuning methods. Crucially, transferring knowledge from a language that matches the language family and morphosyntactic properties seems to be more effective for POS tagging in unseen languages.