xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Mingda Chen; Kevin Heffernan; Onur Çelebi; Alexandre Mourachko; Holger Schwenk

xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Mingda Chen, Kevin Heffernan, Onur Çelebi, Alexandre Mourachko, Holger Schwenk

📝 Paper

Anthology

Underline 🪧 Poster 📺 Watch Video on Underline Add to Favorites

Main: Machine Translation Main-oral Paper

Session 2: Machine Translation (Oral)

Conference Room: Metropolitan West

Conference Time: July 10, 14:00-15:30 (EDT) (America/Toronto)

Global Time: July 10, Session 2 (18:00-19:30 UTC)

Keywords: multilingual mt

TLDR: We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xsim++. In comparison to xsim, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which m...

You can open the #paper-P2258 channel in a separate window.

Abstract: We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xsim++. In comparison to xsim, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xsim, we show that xsim++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. xsim++ also reports performance for different error types, offering more fine-grained feedbacks for model development.