Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

Arnav Anil Mhaske, Harshit Kedia, Sumanth Doddapaneni, Mitesh M. Khapra, Pratyush Kumar, Rudra Murthy, Anoop Kunchukuttan

Main: Resources and Evaluation Main-poster Paper

Poster Session 2: Resources and Evaluation (Poster)
Conference Room: Frontenac Ballroom and Queen's Quay
Conference Time: July 10, 14:00-15:30 (EDT) (America/Toronto)
Global Time: July 10, Poster Session 2 (18:00-19:30 UTC)
Keywords: corpus creation, benchmarking, language resources, automatic creation and evaluation of language resources, nlp datasets, datasets for low resource languages
Languages: hindi, bengali, assamese, gujarati, kannada, malayalam, marathi, tamil, telugu, punjabi, odia
TLDR: We present, \textit{Naamapadam}, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (P...
You can open the #paper-P4826 channel in a separate window.
Abstract: We present, \textit{Naamapadam}, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location, and, Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language translation. We also create manually annotated testsets for 9 languages. We demonstrate the utility of the obtained dataset on the {Naamapadam}-test dataset. We also release \textit{IndicNER}, a multilingual IndicBERT model fine-tuned on {Naamapadam} training set. {IndicNER} achieves an F1 score of more than $80$ for $7$ out of $9$ test languages. The dataset and models are available under open-source licences at {https://ai4bharat.iitm.ac.in/naamapadam}.