A Mutual Information-based Approach to Quantifying Logography in Japanese and Sumerian

Noah Hermalin

The Workshop on Computation and Written Language (CAWL) Paper

TLDR: Writing systems have traditionally been classified by whether they prioritize encoding phonological information (phonographic) versus morphological or semantic information (logographic). Recent work has broached the question of how membership in these categories can be quantified. Sproat and Gutkin
You can open the #paper-CAWL_18 channel in a separate window.
Abstract: Writing systems have traditionally been classified by whether they prioritize encoding phonological information (phonographic) versus morphological or semantic information (logographic). Recent work has broached the question of how membership in these categories can be quantified. Sproat and Gutkin (2021) proposed a range of metrics by which degree of logography can be quantified, including mutual information and a metric based on contextual attention required by a sequence-to-sequence RNN that maps pronunciations to spellings. We aim to build on this work by treating a definition of logography which, in contrast to the definition used by Sproat and Gutkin, more directly incorporates morphological identity. We compare mutual information between graphic forms and phonological forms and between graphic forms and morphological identity for written Japanese and Sumerian. Our results suggest that our methods present a promising means of classifying the degree to which a writing system is logographic or phonographic.