Preserving the Authenticity of Handwritten Learner Language: Annotation Guidelines for Creating Transcripts Retaining Orthographic Features

Christian Gold, Ronja Laarmann-quante, Torsten Zesch

The Workshop on Computation and Written Language (CAWL) Paper

TLDR: Handwritten texts produced by young learners often contain orthographic features like spelling errors, capitalization errors, punctuation mistakes, and impurities such as strikethrough, inserts, and smudges that are typically normalized or ignored in existing transcriptions. For applications like ha
You can open the #paper-CAWL_13 channel in a separate window.
Abstract: Handwritten texts produced by young learners often contain orthographic features like spelling errors, capitalization errors, punctuation mistakes, and impurities such as strikethrough, inserts, and smudges that are typically normalized or ignored in existing transcriptions. For applications like handwriting recognition with the goal of automatically analyzing a learner's language performance, however, retaining such features would be necessary. To address this, we present transcription guidelines that retain the features addressed above. Our guidelines were developed iteratively and include numerous example images to illustrate the various issues. On a subset of about 90 double-transcribed texts, we compute inter-annotator agreement and show that our guidelines can be applied with high levels of percentage agreement of about .98. Overall, we transcribed 1,350 learner texts, which is about the same size as the widely adopted handwriting recognition datasets IAM (1,500 pages) and CVL (1,600 pages). Our final corpus can be used to train a handwriting recognition system that transcribes closely to the real productions by young learners. Such a system is a prerequisite for applying automatic orthography feedback systems to handwritten texts in the future.