Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities

Sina Ahmadi, Antonios Anastasopoulos

Main: Linguistic Diversity Main-oral Paper

Session 3: Linguistic Diversity (Oral)
Conference Room: Pier 7&8
Conference Time: July 11, 09:00-10:30 (EDT) (America/Toronto)
Global Time: July 11, Session 3 (13:00-14:30 UTC)
Keywords: less-resourced languages
Languages: azeri turkish, mazanderani, gilaki, sindhi, kashmiri, central kurdish, northern kurdish, gorani, persian, arabic, urdu
TLDR: The wide accessibility of social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages. This, however, comes with certain challenges in script normalization, particularly where the speakers of a language in a bil...
You can open the #paper-P4208 channel in a separate window.
Abstract: The wide accessibility of social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages. This, however, comes with certain challenges in script normalization, particularly where the speakers of a language in a bilingual community rely on another script or orthography to write their native language. This paper addresses the problem of script normalization for several such languages that are mainly written in a Perso-Arabic script. Using synthetic data with various levels of noise and a transformer-based model, we demonstrate that the problem can be effectively remediated. We conduct a small-scale evaluation of real data as well. Our experiments indicate that script normalization is also beneficial to improve the performance of downstream tasks such as machine translation and language identification.