Training for Grammatical Error Correction Without Human-Annotated L2 Learners' Corpora

Mikio Oda

Training for Grammatical Error Correction Without Human-Annotated L2 Learners' Corpora

Mikio Oda

Add to Favorites

18th Workshop on Innovative Use of NLP for Building Educational Applications Paper

TLDR: Grammatical error correction (GEC) is a challenging task for non-native second language (L2) learners and learning machines. Data-driven GEC learning requires as much human-annotated genuine training data as possible. However, it is difficult to produce larger-scale human-annotated data, and synthet

RocketChat
Abstract

You can open the #paper-BEA_63 channel in a separate window.

Abstract: Grammatical error correction (GEC) is a challenging task for non-native second language (L2) learners and learning machines. Data-driven GEC learning requires as much human-annotated genuine training data as possible. However, it is difficult to produce larger-scale human-annotated data, and synthetically generated large-scale parallel training data is valuable for GEC systems. In this paper, we propose a method for rebuilding a corpus of synthetic parallel data using target sentences predicted by a GEC model to improve performance. Experimental results show that our proposed pre-training outperforms that on the original synthetic datasets. Moreover, it is also shown that our proposed training without human-annotated L2 learners' corpora is as practical as conventional full pipeline training with both synthetic datasets and L2 learners' corpora in terms of accuracy.