Chinese-DiMLex: A Lexicon of Chinese Discourse Connectives

Shujun Wan, Peter Bourgonje, Hongling Xiao, Clara Wan Ching Ho, Manfred Stede

4th Workshop on Computational Approaches to Discourse Extended abstract Paper

TLDR: Machine-readable inventories of discourse connectives that provide information on multiple levels are valuable resources for automated discourse analysis, e.g. discourse parsing, machine translation, text summarization and argumentation mining. While there are already several connective lexicons ava
You can open the #paper-CODI_13 channel in a separate window.
Abstract: Machine-readable inventories of discourse connectives that provide information on multiple levels are valuable resources for automated discourse analysis, e.g. discourse parsing, machine translation, text summarization and argumentation mining. While there are already several connective lexicons available for certain languages (such as German, English, French, Czech, Portuguese, Hebrew, and Spanish), currently, there is no such resource available for Chinese, despite it being one of the most widely spoken languages in the world. To address this gap, we developed the Chinese-DimLex, a discourse lexicon for Chinese (Mandarin). It features 137 Chinese connectives () and is augmented with five layers of information, specifically morphological variations, syntactic categories (part-of-speech), semantic relations (PDTB3.0 sense inventory), usage examples, and English translations. Chinese-DimLex is publicly accessible in both XML format and through an easy-to-use web-interface, which enables browsing and searching of the lexicon, as well as comparison of discourse connectives across different languages based on their syntactic and semantic properties. In this extended abstract, we provide an overview of the data and the workflow used to populate the lexicon, followed by discussion of several Chinese-specific considerations and issues that arose during the process. By submitting this abstract, we aim to a) contribute to discourse research and b) receive feedback to promote and expand the lexicon for future work.