InfoSync: Information Synchronization across Multilingual Semi-structured Tables

Siddharth Hemant Khincha; Chelsi Jain; Vivek Gupta; Tushar Kataria; Shuo Zhang

InfoSync: Information Synchronization across Multilingual Semi-structured Tables

Siddharth Hemant Khincha, Chelsi Jain, Vivek Gupta, Tushar Kataria, Shuo Zhang

📝 Paper

Anthology

Underline 🪧 Poster 🧑‍🏫 Slides 📺 Watch Video on Underline Add to Favorites

Findings: Resources and Evaluation Findings Paper

Session 1: Resources and Evaluation (Virtual Poster)

Conference Room: Pier 7&8

Conference Time: July 10, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 10, Session 1 (15:00-16:30 UTC)

Spotlight Session: Spotlight - Metropolitan East (Spotlight)

Conference Room: Metropolitan East

Conference Time: July 10, 19:00-21:00 (EDT) (America/Toronto)

Global Time: July 10, Spotlight Session (23:00-01:00 UTC)

Keywords: corpus creation, multilingual corpora, nlp datasets, datasets for low resource languages

TLDR: Information Synchronization of semi-structured data across languages is challenging. For example, Wikipedia tables in one language need to be synchronized with others. To address this problem, we introduce a new dataset InfoSync and a two-step method for tabular synchronization. InfoSync contains 1...

You can open the #paper-P4682 channel in a separate window.

Abstract: Information Synchronization of semi-structured data across languages is challenging. For example, Wikipedia tables in one language need to be synchronized with others. To address this problem, we introduce a new dataset InfoSync and a two-step method for tabular synchronization. InfoSync contains 100K entity-centric tables (Wikipedia Infoboxes) across 14 languages, of which a subset (~3.5K pairs) are manually annotated. The proposed method includes 1) Information Alignment to map rows and 2) Information Update for updating missing/outdated information for aligned tables across multilingual tables. When evaluated on InfoSync, information alignment achieves an F1 score of 87.91 (en <-> non-en). To evaluate information updation, we perform human-assisted Wikipedia edits on Infoboxes for 532 table pairs. Our approach obtains an acceptance rate of 77.28\% on Wikipedia, showing the effectiveness of the proposed method.