Retrieval-enhanced language models (LMs), which condition their predictions
on text retrieved from large external datastores, have recently shown
significant perplexity improvements compared to standard LMs. One such
approach, the $k$NN-LM, interpolates any existing LM's predictions with the
output of a $k$-nearest neighbors model and requires no additional training. In
this paper, we explore the importance of lexical and semantic matching in the
context of items retrieved by $k$NN-LM. We find two trends: (1) the presence of
large overlapping $n$-grams between the datastore and evaluation set plays an
important factor in strong performance, even when the datastore is derived from
the training data; and (2) the $k$NN-LM is most beneficial when retrieved items
have high semantic similarity with the query. Based on our analysis, we define
a new formulation of the $k$NN-LM that uses retrieval quality to assign the
interpolation coefficient. We empirically measure the effectiveness of our
approach on two English language modeling datasets, Wikitext-103 and PG-19. Our
re-formulation of the $k$NN-LM is beneficial in both cases, and leads to nearly
4% improvement in perplexity on the Wikitext-103 test set.

研究了基于检索增强的语言模型中 $k$NN-LM 中检索文本的词汇和语义匹配对于性能的影响，并通过使用检索结果质量确定插值系数的新表述，成功地提高了英文语言建模数据集 Wikitext-103 和 PG-19 中的困惑度近 4%。

邻居非自选：关于 $k$NN-LM 中如何选择何时依赖检索

You can't pick your neighbors, or can you? When and how to rely on retrieval in the $k$NN-LM

Class-based language models (LMs) have been long devised to address context
sparsity in $n$-gram LMs. In this study, we revisit this approach in the
context of neural LMs. We hypothesize that class-based prediction leads to an
implicit context aggregation for similar words and thus can improve
generalization for rare words. We map words that have a common WordNet hypernym
to the same class and train large neural LMs by gradually annealing from
predicting the class to token prediction during training. Empirically, this
curriculum learning strategy consistently improves perplexity over various
large, highly-performant state-of-the-art Transformer-based models on two
datasets, WikiText-103 and Arxiv. Our analysis shows that the performance
improvement is achieved without sacrificing performance on rare words. Finally,
we document other attempts that failed to yield empirical gains, and discuss
future directions for the adoption of class-based LMs on a larger scale.

通过将具有相同 WordNet 超类的单词映射到同一类中，并逐渐从预测类逐步训练为预测单词，我们在两个数据集上证明了该课程学习策略能够显著提高困惑度而不影响罕见词性能。

基于上位词类别预测的更好语言模型

Better Language Model with Hypernym Class Prediction

We consider language modelling (LM) as a multi-label structured prediction
task by re-framing training from solely predicting a single ground-truth word
to ranking a set of words which could continue a given context. To avoid
annotating top-$k$ ranks, we generate them using pre-trained LMs: GPT-2, BERT,
and Born-Again models. This leads to a rank-based form of knowledge
distillation (KD). We also develop a method using $N$-grams to create a
non-probabilistic teacher which generates the ranks without the need of a
pre-trained LM.
We confirm the hypotheses that we can treat LMing as a ranking task and that
we can do so without the use of a pre-trained LM. We show that rank-based KD
generally improves perplexity (PPL), often with statistical significance, when
compared to Kullback-Leibler-based KD. Surprisingly, given the simplicity of
the method, $N$-grams act as competitive teachers and achieve similar
performance as using either BERT or a Born-Again model teachers. GPT-2 always
acts as the best teacher, though, and using it and a Transformer-XL student on
Wiki-02, rank-based KD reduces a cross-entropy baseline from 65.27 to 55.94 and
against a KL-based KD of 56.70.

通过使用预先训练的 GPT-2、BERT 和 Born-Again 模型来生成排名来避免标注排名，建立语言建模为排序任务的方法，并使用 $n$-gram 创建非概率性教师，证实我们可以将 $LMing$ 视为排序任务而不使用预先训练的 LM，并且在比较 KL-based KD 时通常通过统计显著性提高 perplexity。

通过学习排序进行语言建模

Language Modelling via Learning to Rank

We introduce adaptive input representations for neural language modeling
which extend the adaptive softmax of Grave et al. (2017) to input
representations of variable capacity. There are several choices on how to
factorize the input and output layers, and whether to model words, characters
or sub-word units. We perform a systematic comparison of popular choices for a
self-attentional architecture. Our experiments show that models equipped with
adaptive embeddings are more than twice as fast to train than the popular
character input CNN while having a lower number of parameters. On the
WikiText-103 benchmark we achieve 18.7 perplexity, an improvement of 10.5
perplexity compared to the previously best published result and on the Billion
Word benchmark, we achieve 23.02 perplexity.

论文介绍了适应性输入表示对神经语言建模的重要性，对比了在自注意力网络结构中以字符、单词和亚词元为单位以及输入和输出层的因数分解方案，并最终实现了在不增加参数前提下提高模型训练速度和减少困惑度的目的。