Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation
Soyoung Yoon, Sungjoon Park, Gyuwan Kim, Junhee Cho, Kihyo Park, Gyu Tae Kim, Minjoon Seo, Alice Oh
Main: Resources and Evaluation Main-poster Paper
Poster Session 3: Resources and Evaluation (Poster)
Conference Room: Frontenac Ballroom and Queen's Quay
Conference Time: July 11, 09:00-10:30 (EDT) (America/Toronto)
Global Time: July 11, Poster Session 3 (13:00-14:30 UTC)
Keywords:
datasets for low resource languages
Languages:
korean
TLDR:
Research on Korean grammatical error correction (GEC) is limited, compared to other major languages such as English. We attribute this problematic circumstance to the lack of a carefully designed evaluation benchmark for Korean GEC. In this work, we collect three datasets from different sources (Kor...
You can open the
#paper-P860
channel in a separate window.
Abstract:
Research on Korean grammatical error correction (GEC) is limited, compared to other major languages such as English. We attribute this problematic circumstance to the lack of a carefully designed evaluation benchmark for Korean GEC. In this work, we collect three datasets from different sources (Kor-Lang8, Kor-Native, and Kor-Learner) that covers a wide range of Korean grammatical errors. Considering the nature of Korean grammar, We then define 14 error types for Korean and provide KAGAS (Korean Automatic Grammatical error Annotation System), which can automatically annotate error types from parallel corpora. We use KAGAS on our datasets to make an evaluation benchmark for Korean, and present baseline models trained from our datasets. We show that the model trained with our datasets significantly outperforms the currently used statistical Korean GEC system (Hanspell) on a wider range of error types, demonstrating the diversity and usefulness of the datasets. The implementations and datasets are open-sourced.