Tokenization - the practice of converting strings of characters over an alphabet into sequences of tokens over a vocabulary - is a critical yet under-theorized step in the NLP pipeline. Notably, it remains the only major step not fully integrated into widely used end-to-end neural models. This paper aims to address this theoretical gap by laying the foundations of tokenization from a formal perspective. By articulating and extending basic properties about the category of stochastic maps, we propose a unified framework for representing and analyzing tokenizer models. This framework allows us to establish general conditions for the use of tokenizers. In particular, we formally establish the necessary and sufficient conditions for a tokenizer model to preserve the consistency of statistical estimators. Additionally, we discuss statistical and computational concerns crucial for the design and implementation of tokenizer models. The framework and results advanced in this paper represent a step toward a robust theoretical foundation for neural language modeling.

本文旨在从形式角度奠定分词（Tokenization）的基础，通过阐述和扩展随机映射类别的基本属性，我们提出了一个统一的框架来表示和分析分词器模型，同时讨论了设计和实施分词器模型所必不可少的统计和计算问题。这项工作向神经语言建模的稳健理论基础迈出了一步。

数字化代币化的基础：统计与计算问题