In \textit{Tokenization and the Noiseless Channel}
\cite{zouhar-etal-2023-tokenization}, R\'enyi efficiency is suggested as an
intrinsic mechanism for evaluating a tokenizer: for NLP tasks, the tokenizer
which leads to the highest R\'enyi efficiency of the unigram distribution
should be chosen. The R\'enyi efficiency is thus treated as a predictor of
downstream performance (e.g., predicting BLEU for a machine translation task),
without the expensive step of training multiple models with different
tokenizers. Although useful, the predictive power of this metric is not
perfect, and the authors note there are additional qualities of a good
tokenization scheme that R\'enyi efficiency alone cannot capture.
We describe two variants of BPE tokenization which can arbitrarily increase
R\'enyi efficiency while decreasing the downstream model performance. These
counterexamples expose cases where R\'enyi efficiency fails as an intrinsic
tokenization metric and thus give insight for building more accurate
predictors.

通过分析两种变体的 BPE 分词方法，本研究揭示了用 Rényi 效能作为分词度量指标的局限性，为构建更准确的预测器提供了启示。

extit {Tokenization and the Noiseless Channel}》两个反例

Two Counterexamples to \textit{Tokenization and the Noiseless Channel}

The latest generative large language models (LLMs) have found their
application in data augmentation tasks, where small numbers of text samples are
LLM-paraphrased and then used to fine-tune the model. However, more research is
needed to assess how different prompts, seed data selection strategies,
filtering methods, or model settings affect the quality of paraphrased data
(and downstream models). In this study, we investigate three text diversity
incentive methods well established in crowdsourcing: taboo words, hints by
previous outlier solutions, and chaining on previous outlier solutions. Using
these incentive methods as part of instructions to LLMs augmenting text
datasets, we measure their effects on generated texts' lexical diversity and
downstream model performance. We compare the effects over 5 different LLMs and
6 datasets. We show that diversity is most increased by taboo words, while
downstream model performance is highest when previously created paraphrases are
used as hints.

最新的生成型大规模语言模型（LLM）被应用于数据增强任务，在这些任务中使用少量文本样本进行 LLM 重述，然后用于模型的微调。本研究调查了三种在众包中广泛使用的文本多样性激励方法：禁忌词、先前异常解决方案的提示和先前异常解决方案的链接，并使用它们作为指导 LLM 对文本数据集进行增强的一部分的指令，测量它们对生成文本的词汇多样性和下游模型性能的影响。我们比较了在 5 种不同 LLM 和 6 个数据集上的影响效果。研究结果表明，禁忌词对多样性的增加最为显著，而使用先前创建的重述作为提示时下游模型性能最佳。

基于 LLM 的文本增强中多样性激励对样本多样性和下游模型性能的影响

Effects of diversity incentives on sample diversity and downstream model  performance in LLM-based text augmentation

Subword tokenization is a key part of many NLP pipelines. However, little is
known about why some tokenizer and hyperparameter combinations lead to better
downstream model performance than others. We propose that good tokenizers lead
to \emph{efficient} channel usage, where the channel is the means by which some
input is conveyed to the model and efficiency can be quantified in
information-theoretic terms as the ratio of the Shannon entropy to the maximum
possible entropy of the token distribution. Yet, an optimal encoding according
to Shannon entropy assigns extremely long codes to low-frequency tokens and
very short codes to high-frequency tokens. Defining efficiency in terms of
R\'enyi entropy, on the other hand, penalizes distributions with either very
high or very low-frequency tokens. In machine translation, we find that across
multiple tokenizers, the R\'enyi entropy with $\alpha = 2.5$ has a very strong
correlation with \textsc{Bleu}: $0.78$ in comparison to just $-0.32$ for
compressed length.

本文研究了子词分割在自然语言处理中的应用，发现利用 Rényi 熵而非 Shannon 熵可以提高机器翻译的效果。