Should you marginalize over possible tokenizations?

Nadezhda Chirkova; Germán Kruszewski; Jos Rozen; Marc Dymetman

Should you marginalize over possible tokenizations?

Nadezhda Chirkova, Germán Kruszewski, Jos Rozen, Marc Dymetman

📝 Paper

Anthology

Underline 🪧 Poster 🧑‍🏫 Slides 📺 Watch Video on Underline Add to Favorites

Main: Large Language Models Main-poster Paper

Poster Session 7: Large Language Models (Poster)

Conference Room: Frontenac Ballroom and Queen's Quay

Conference Time: July 12, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 12, Poster Session 7 (15:00-16:30 UTC)

Keywords: interpretability/analysis

TLDR: Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is to first transform it into a sequence of tokens that is scored by the model. However, there are exponentially many token seq...

You can open the #paper-P1302 channel in a separate window.

Abstract: Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is to first transform it into a sequence of tokens that is scored by the model. However, there are exponentially many token sequences that represent any given string. To truly compute the probability of a string one should marginalize over all tokenizations, which is typically intractable. Here, we analyze whether the practice of ignoring the marginalization is justified. To this end, we devise an importance-sampling-based algorithm that allows us to compute estimates of the marginal probabilities and compare them to the default procedure in a range of state-of-the-art models and datasets. Our results show that the gap in log-likelihood is no larger than 0.5\% in most cases, but that it becomes more pronounced for data with long complex words.