IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation Metrics for Indian Languages
Ananya Sai B, Tanay Dixit, Vignesh Nagarajan, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra, Raj Dabre
Main: Resources and Evaluation Main-poster Paper
    Poster Session 7: Resources and Evaluation (Poster)
    
Conference Room: Frontenac Ballroom and Queen's Quay 
    Conference Time: July 12, 11:00-12:30 (EDT) (America/Toronto)
    Global Time: July 12, Poster Session 7 (15:00-16:30 UTC)
    
    
  
          Keywords:
          corpus creation, evaluation, datasets for low resource languages, metrics
        
        
        
        
          Languages:
          hindi, gujarati, marathi, tamil, malayalam
        
        
        
          TLDR:
          The rapid growth of machine translation (MT) systems necessitates meta-evaluations of evaluation metrics to enable selection of those that best reflect MT quality. Unfortunately, most meta-evaluation studies focus on European languages, the observations for which may not always apply to other langua...
        
  
    You can open the
    #paper-P1879
    channel in a separate window.
  
  
    
            Abstract:
            The rapid growth of machine translation (MT) systems necessitates meta-evaluations of evaluation metrics to enable selection of those that best reflect MT quality. Unfortunately, most meta-evaluation studies focus on European languages, the observations for which may not always apply to other languages. Indian languages, having over a billion speakers, are linguistically different from them, and to date, there are no such systematic studies focused solely on English to Indian language MT. This paper fills this gap through a Multidimensional Quality Metric (MQM) dataset consisting of 7000 fine-grained annotations, spanning 5 Indian languages and 7 MT systems. We evaluate 16 metrics and show that, pre-trained metrics like COMET have the highest correlations with annotator scores as opposed to n-gram metrics like BLEU. We further leverage our MQM annotations to develop an Indic-COMET metric and show that it outperforms COMET counterparts in both human scores correlations and robustness scores in Indian languages. Additionally, we show that the Indic-COMET can outperform COMET on some unseen Indian languages. We hope that our dataset and analysis will facilitate further research in Indic MT evaluation.
          
         Anthology
 Anthology
       Underline
 Underline