AraMUS: Pushing the Limits of Data and Model Scale for Arabic Natural Language Processing

Asaad Alghamdi, Xinyu Duan, Wei Jiang, Zhenhai Wang, Yimeng Wu, Qingrong Xia, Zhefeng Wang, Yi ZHENG, Mehdi Rezagholizadeh, baoxing Huai, Peilun Cheng, Abbas Ghaddar

Findings: Large Language Models Findings Paper

Session 4: Large Language Models (Virtual Poster)
Conference Room: Pier 7&8
Conference Time: July 11, 11:00-12:30 (EDT) (America/Toronto)
Global Time: July 11, Session 4 (15:00-16:30 UTC)
Keywords: pre-training
Languages: arabic
TLDR: Developing monolingual large Pre-trained Language Models (PLMs) is shown to be very successful in handling different tasks in Natural Language Processing (NLP). In this work, we present AraMUS, the largest Arabic PLM with 11B parameters trained on 529GB of high-quality Arabic textual data. AraMUS a...
You can open the #paper-P1834 channel in a separate window.
Abstract: Developing monolingual large Pre-trained Language Models (PLMs) is shown to be very successful in handling different tasks in Natural Language Processing (NLP). In this work, we present AraMUS, the largest Arabic PLM with 11B parameters trained on 529GB of high-quality Arabic textual data. AraMUS achieves state-of-the-art performances on a diverse set of Arabic classification and generative tasks. Moreover, AraMUS shows impressive few-shot learning abilities compared with the best existing Arabic PLMs.