The prevalent use of Byte Pair Encoding (BPE) in Large Language Models (LLMs) facilitates robust handling of subword units and avoids issues of out-of-vocabulary words. Despite its success, a critical challenge persists: long tokens, rich in semantic information, have fewer occurrences in tokenized datasets compared to short tokens, which can result in imbalanced learning issue across different tokens. To address that, we propose LBPE, which prioritizes long tokens during the encoding process. LBPE generates tokens according to their reverse ranks of token length rather than their ranks in the vocabulary, granting longer tokens higher priority during the encoding process. Consequently, LBPE smooths the frequency differences between short and long tokens, and thus mitigates the learning imbalance. Extensive experiments across diverse language modeling tasks demonstrate that LBPE consistently outperforms the original BPE, well demonstrating its effectiveness.

本研究解决了大型语言模型中长令牌频次不足导致学习不平衡的问题。提出的LBPE方法在编码过程中优先考虑长令牌，从而平衡短令牌和长令牌之间的频率差异。实验结果表明，LBPE在多种语言建模任务中表现优于传统的字节对编码（BPE），展示了其有效性。

LBPE：优先处理长令牌的分词方法以改善大型语言模型