BriefGPT.xyz
Jul, 2024
数字化代币化的基础:统计与计算问题
The Foundations of Tokenization: Statistical and Computational Concerns
HTML
PDF
Juan Luis Gastaldi, John Terilla, Luca Malagutti, Brian DuSell, Tim Vieira...
TL;DR
本文旨在从形式角度奠定分词(Tokenization)的基础,通过阐述和扩展随机映射类别的基本属性,我们提出了一个统一的框架来表示和分析分词器模型,同时讨论了设计和实施分词器模型所必不可少的统计和计算问题。这项工作向神经语言建模的稳健理论基础迈出了一步。
Abstract
tokenization
- the practice of converting strings of characters over an alphabet into sequences of tokens over a vocabulary - is a critical yet under-theorized step in the
nlp pipeline
. Notably, it remains the on
→