[SRW] How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese

Takuro Fujii; Koki Shibata; Atsuki Yamaguchi; Terufumi Morishita; Yasuhiro Sogawa

[SRW] How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese

Takuro Fujii, Koki Shibata, Atsuki Yamaguchi, Terufumi Morishita, Yasuhiro Sogawa

📝 Paper

Anthology

Underline 🪧 Poster 🧑‍🏫 Slides 📺 Watch Video on Underline Add to Favorites

Student Research Workshop Srw Paper

Session 7: Student Research Workshop (Poster)

Conference Room: Frontenac Ballroom and Queen's Quay

Conference Time: July 12, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 12, Session 7 (15:00-16:30 UTC)

TLDR: This paper investigates the effect of tokenizers on the downstream performance of pretrained language models (PLMs) in scriptio continua languages where no explicit spaces exist between words, using Japanese as a case study. The tokenizer for such languages often consists of a morphological analyzer...

You can open the #paper-S12 channel in a separate window.

Abstract: This paper investigates the effect of tokenizers on the downstream performance of pretrained language models (PLMs) in scriptio continua languages where no explicit spaces exist between words, using Japanese as a case study. The tokenizer for such languages often consists of a morphological analyzer and a subword tokenizer, requiring us to conduct a comprehensive study of all possible pairs. However, previous studies lack this comprehensiveness. We therefore train extensive sets of tokenizers, build a PLM using each, and measure the downstream performance on a wide range of tasks. Our results demonstrate that each downstream task has a different optimal morphological analyzer, and that it is better to use Byte-Pair-Encoding or Unigram rather than WordPiece as a subword tokenizer, regardless of the type of task.