Length extrapolation is a desirable property that permits training a
transformer language model on short sequences and retaining similar
perplexities when the model is tested on substantially longer sequences. A
relative positional embedding mechanism applied on the transformer
self-attention matrix, ALiBi, demonstrates the length extrapolation property
with the widest usage to date. In this paper, we show that ALiBi surprisingly
does not utilize tokens further than the training sequence length, which can be
explained by its implicit windowed attention effect that aligns the receptive
field during training and testing stages. Inspired by ALiBi and the receptive
filed alignment hypothesis, we propose another transformer positional embedding
design named~\textbf{Sandwich} that uses longer than training sequence length
information, and it is a greatly simplified formulation of the earliest
proposed Sinusoidal positional embedding. Finally, we show that both ALiBi and
Sandwich enable efficient inference thanks to their implicit windowed attention
effect.

研究了相对位置嵌入在语言模型上的应用，提出了基于对齐假设的自注意力机制，在训练过程中对齐输入，在测试过程中保证了相对位置嵌入的性质。提出的 Sandwich positional embedding 将比训练序列更长的信息融入模型之中，且由于隐式窗口化的自注意力机制，其可实现高效的推断。

感受野对齐实现 Transformer 长度外推

Receptive Field Alignment Enables Transformer Length Extrapolation

Deep learning (DL) techniques involving fine-tuning large numbers of model
parameters have delivered impressive performance on the task of discriminating
between language produced by cognitively healthy individuals, and those with
Alzheimer's disease (AD). However, questions remain about their ability to
generalize beyond the small reference sets that are publicly available for
research. As an alternative to fitting model parameters directly, we propose a
novel method by which a Transformer DL model (GPT-2) pre-trained on general
English text is paired with an artificially degraded version of itself (GPT-D),
to compute the ratio between these two models' \textit{perplexities} on
language from cognitively healthy and impaired individuals. This technique
approaches state-of-the-art performance on text data from a widely used "Cookie
Theft" picture description task, and unlike established alternatives also
generalizes well to spontaneous conversations. Furthermore, GPT-D generates
text with characteristics known to be associated with AD, demonstrating the
induction of dementia-related linguistic anomalies. Our study is a step toward
better understanding of the relationships between the inner workings of
generative neural language models, the language that they produce, and the
deleterious effects of dementia on human speech and language characteristics.

该研究提出了一种新颖的方法，利用 Transformer DL 模型（GPT-2）和与其人为降级版本（GPT-D）之间的困惑度比率，在语言学健康和损伤个体的语言上获得了接近于最先进性能的技术，也演示了通过 GPT-D 产生与 AD 相关的语言异常特征的能力，这对于更好地理解生成神经语言模型的内部工作机制、它们产生的语言以及痴呆症对人类语音和语言特征的不良影响是一步。

GPT-D: 通过有意识地降低人工神经语言模型的能力引发与老年痴呆相关的语言异常

GPT-D: Inducing Dementia-related Linguistic Anomalies by Deliberate Degradation of Artificial Neural Language Models

The Transformer architecture is superior to RNN-based models in computational
efficiency. Recently, GPT and BERT demonstrate the efficacy of Transformer
models on various NLP tasks using pre-trained language models on large-scale
corpora. Surprisingly, these Transformer architectures are suboptimal for
language model itself. Neither self-attention nor the positional encoding in
the Transformer is able to efficiently incorporate the word-level sequential
context crucial to language modeling.
In this paper, we explore effective Transformer architectures for language
model, including adding additional LSTM layers to better capture the sequential
context while still keeping the computation efficient. We propose Coordinate
Architecture Search (CAS) to find an effective architecture through iterative
refinement of the model. Experimental results on the PTB, WikiText-2, and
WikiText-103 show that CAS achieves perplexities between 20.42 and 34.11 on all
problems, i.e. on average an improvement of 12.0 perplexity units compared to
state-of-the-art LSTMs. The source code is publicly available.

本篇论文针对 Transformer 架构不足以高效融合语言建模所需的单词级序列上下文，提出了在保持计算效率的同时通过添加额外的 LSTM 层能够更好地捕捉顺序上下文的有效 Transformer 架构，其中 Coordinate Architecture Search（CAS）通过迭代模型的精炼来找到一个有效的架构，实验结果表明 CAS 在所有问题上的 perplexities 达到了 20.42 ~ 34.11，即比最先进的 LSTM 提高了 12.0 perplexity 单位。