[Industry] CWSeg: An Efficient and General Approach to Chinese Word Segmentation

Dedong Li, Rui Zhao, Fei Tan

Industry: Industry Industry Paper

Session 5: Industry (Poster)
Conference Room: Frontenac Ballroom and Queen's Quay
Conference Time: July 11, 16:15-17:45 (EDT) (America/Toronto)
Global Time: July 11, Session 5 (20:15-21:45 UTC)
TLDR: In this work, we report our efforts in advancing Chinese Word Segmentation for the purpose of rapid deployment in different applications. The pre-trained language model (PLM) based segmentation methods have achieved state-of-the-art (SOTA) performance, whereas this paradigm also poses challenges in ...
You can open the #paper-I3 channel in a separate window.
Abstract: In this work, we report our efforts in advancing Chinese Word Segmentation for the purpose of rapid deployment in different applications. The pre-trained language model (PLM) based segmentation methods have achieved state-of-the-art (SOTA) performance, whereas this paradigm also poses challenges in the deployment. It includes the balance between performance and cost, segmentation ambiguity due to domain diversity and vague words boundary, and multi-grained segmentation. In this context, we propose a simple yet effective approach, namely CWSeg, to augment PLM-based schemes by developing cohort training and versatile decoding strategies. Extensive experiments on benchmark datasets demonstrate the efficiency and generalization of our approach. The corresponding segmentation system is also implemented for practical usage and the demo is recorded.