Learning the Character Inventories of Undeciphered Scripts Using Unsupervised Deep Clustering
Logan Born, M. Willis Monroe, Kathryn Kelley, Anoop Sarkar
The Workshop on Computation and Written Language (CAWL) Paper
TLDR:
A crucial step in deciphering a text is to identify what set of characters were used to write it. This requires grouping character tokens according to visual and contextual features, which can be challenging for human analysts when the number of tokens or underlying types is large. Prior work has sh
You can open the
#paper-CAWL_6
channel in a separate window.
Abstract:
A crucial step in deciphering a text is to identify what set of characters were used to write it. This requires grouping character tokens according to visual and contextual features, which can be challenging for human analysts when the number of tokens or underlying types is large. Prior work has shown that this process can be automated by clustering dense representations of character images, in a task which we call ``script clustering''. In this work, we present novel architectures which exploit varying degrees of contextual and visual information to learn representations for use in script clustering. We evaluate on a range of modern and ancient scripts, and find that our models produce representations which are more effective for script recovery than the current state-of-the-art, despite using just \textasciitilde{}2\textbackslash{}\% as many parameters. Our analysis fruitfully applies these models to assess hypotheses about the character inventory of the partially-deciphered proto-Elamite script.