Hints on the data for language modeling of synthetic languages with transformers

Rodolfo Joel Zevallos, Nuria Bel

Main: Resources and Evaluation Main-poster Paper

Poster Session 4: Resources and Evaluation (Poster)
Conference Room: Frontenac Ballroom and Queen's Quay
Conference Time: July 11, 11:00-12:30 (EDT) (America/Toronto)
Global Time: July 11, Poster Session 4 (15:00-16:30 UTC)
Keywords: evaluation
Languages: german, french, turkish, quechua
TLDR: Language Models (LM) are becoming more and more useful for providing representations upon which to train Natural Language Processing applications. However, there is now clear evidence that attention-based transformers require a critical amount of language data to produce good enough LMs. The questi...
You can open the #paper-P2079 channel in a separate window.
Abstract: Language Models (LM) are becoming more and more useful for providing representations upon which to train Natural Language Processing applications. However, there is now clear evidence that attention-based transformers require a critical amount of language data to produce good enough LMs. The question we have addressed in this paper is to what extent the critical amount of data varies for languages of different morphological typology, in particular those that have a rich inflectional morphology, and whether the tokenization method to preprocess the data can make a difference. These details can be important for low-resourced languages that need to plan the production of datasets. We evaluated intrinsically and extrinsically the differences of five different languages with different pretraining dataset sizes and three different tokenization methods for each. The results confirm that the size of the vocabulary due to morphological characteristics is directly correlated with both the LM perplexity and the performance of two typical downstream tasks such as NER identification and POS labeling. The experiments also provide new evidence that a canonical tokenizer can reduce perplexity by more than a half for a polysynthetic language like Quechua as well as raising F1 from 0.8 to more than 0.9 in both downstream tasks with a LM trained with only 6M tokens.