Character-level language models obviate the need for separately trained
tokenizers, but efficiency suffers from longer sequence lengths. Learning to
combine character representations into tokens has made training these models
more efficient, but they still require decoding characters individually. We
propose Toucan, an augmentation to character-level models to make them
"token-aware". Comparing our method to prior work, we demonstrate significant
speed-ups in character generation without a loss in language modeling
performance. We then explore differences between our learned dynamic
tokenization of character sequences with popular fixed vocabulary solutions
such as Byte-Pair Encoding and WordPiece, finding our approach leads to a
greater amount of longer sequences tokenized as single items. Our project and
code are available at this https URL

通过学习将字符表示组合成标记的方式，我们提出了 Toucan，它是一种改进的字符级模型，使其更具 “标记感知” 能力。与先前的方法相比，我们的方法在字符生成方面显著加快速度，同时保持了语言建模性能。我们还探索了学习到的字符序列的动态标记化与流行的固定词汇解决方案（如字节对编码和 WordPiece）之间的差异，发现我们的方法导致更多较长的序列以单个项目进行标记。

Toucan: 基于标记的字符级语言建模

Toucan: Token-Aware Character Level Language Modeling

Multilingual language models are widely used to extend NLP systems to
low-resource languages. However, concrete evidence for the effects of
multilinguality on language modeling performance in individual languages
remains scarce. Here, we pre-train over 10,000 monolingual and multilingual
language models for over 250 languages, including multiple language families
that are under-studied in NLP. We assess how language modeling performance in
each language varies as a function of (1) monolingual dataset size, (2) added
multilingual dataset size, (3) linguistic similarity of the added languages,
and (4) model size (up to 45M parameters). We find that in moderation, adding
multilingual data improves low-resource language modeling performance, similar
to increasing low-resource dataset sizes by up to 33%. Improvements depend on
the syntactic similarity of the added multilingual data, with marginal
additional effects of vocabulary overlap. However, high-resource languages
consistently perform worse in multilingual pre-training scenarios. As dataset
sizes increase, adding multilingual data begins to hurt performance for both
low-resource and high-resource languages, likely due to limited model capacity
(the "curse of multilinguality"). These results suggest that massively
multilingual pre-training may not be optimal for any languages involved, but
that more targeted models can significantly improve performance.

添加多语言数据可以提高低资源语言模型的性能，但对于高资源语言而言，添加多语言数据可能会降低性能。

多语种模型用于 200 多种高低资源语言的研究

When Is Multilinguality a Curse? Language Modeling for 250 High- and  Low-Resource Languages

Predicting upcoming events is critical to our ability to interact with our
environment. Transformer models, trained on next-word prediction, appear to
construct representations of linguistic input that can support diverse
downstream tasks. But how does a predictive objective shape such
representations? Inspired by recent work in vision (Henaff et al., 2019), we
test a hypothesis about predictive representations of autoregressive
transformers. In particular, we test whether the neural trajectory of a
sentence becomes progressively straighter as it passes through the network
layers. The key insight is that straighter trajectories should facilitate
prediction via linear extrapolation. We quantify straightness using a
1-dimensional curvature metric, and present four findings in support of the
trajectory straightening hypothesis: i) In trained models, the curvature
decreases from the early to the deeper layers of the network. ii) Models that
perform better on the next-word prediction objective exhibit greater decreases
in curvature, suggesting that this improved ability to straighten sentence
trajectories may be the driver of better language modeling performance. iii)
Given the same linguistic context, the sequences that are generated by the
model have lower curvature than the actual continuations observed in a language
corpus, suggesting that the model favors straighter trajectories for making
predictions. iv) A consistent relationship holds between the average curvature
and the average surprisal of sentences in the deep model layers, such that
sentences with straighter trajectories also have lower surprisal. Importantly,
untrained models do not exhibit these behaviors. In tandem, these results
support the trajectory straightening hypothesis and provide a possible
mechanism for how the geometry of the internal representations of
autoregressive models supports next word prediction.

用于预测的自回归变换器的预测表示通过逐渐变得更加直线化来实现更好的语言建模性能，并与句子的惊异程度之间存在一致的关系。

大型语言模型隐式学习将神经句子轨迹纠正为自然语言的预测性表示

Large language models implicitly learn to straighten neural sentence  trajectories to construct a predictive representation of natural language

Neural language models are probabilistic models of human text. They are
predominantly trained using maximum likelihood estimation (MLE), which is
equivalent to minimizing the forward cross-entropy between the empirical data
distribution and the model distribution. However, various degeneration
phenomena are still widely observed when decoding from the distributions
learned by such models. We establish that the forward cross-entropy is
suboptimal as a distance metric for aligning human and model distribution due
to its (1) recall-prioritization (2) negative diversity ignorance and (3)
train-test mismatch. In this paper, we propose Earth Mover Distance
Optimization (EMO) for auto-regressive language modeling. EMO capitalizes on
the inherent properties of earth mover distance to address the aforementioned
challenges. Due to the high complexity of direct computation, we further
introduce a feasible upper bound for EMO to ease end-to-end training. Upon
extensive evaluation of language models trained using EMO and MLE. We find that
EMO demonstrates a consistently better language modeling performance than MLE
across domains. Moreover, EMO demonstrates noteworthy enhancements in
downstream performance with minimal fine-tuning on merely 25,000 sentences.
This highlights the tremendous potential of EMO as a lightweight calibration
method for enhancing large-scale pre-trained language models.

神经语言模型是人类文本的概率模型，主要使用最大似然估计进行训练。本文提出了基于 EMD 优化的自回归语言建模方法，通过对 EMD 的上界估计实现了端到端训练，并在广泛评估中表现出了比 MLE 更好的语言建模性能。此外，EMO 还能在仅微调 25000 个句子的情况下大幅提升下游任务性能，展现了作为轻量级校准方法的巨大潜力。