Causal transformer language models (LMs), such as GPT-3, typically require
some form of positional encoding, such as positional embeddings. However, we
show that LMs without any explicit positional encoding are still competitive
with standard models, and that this phenomenon is robust across different
datasets, model sizes, and sequence lengths. Probing experiments reveal that
such models acquire an implicit notion of absolute positions throughout the
network, effectively compensating for the missing information. We conjecture
that causal attention enables the model to infer the number of predecessors
that each token can attend to, thereby approximating its absolute position. Our
findings indicate that causal LMs might derive positional awareness not only
from the explicit positioning mechanism, but also from the effects of the
causal mask.

本研究探讨了基于因果变换的语言模型（LMs），例如 GPT-3，需要某种形式的位置编码，例如位置嵌入。然而，我们发现在没有任何显式位置编码的情况下，这样的 LM 与标准模型仍然具有竞争力，这一现象在不同的数据集、模型大小和序列长度中是鲁棒的。进一步实验表明，这种模型通过网络获取隐含的绝对位置概念，从而有效弥补了缺失的信息。我们推测，因果注意力使模型能够推断每个令牌可以关注的前任数，从而近似其绝对位置。我们的发现表明，因果 LMs 除了显式的定位机制外，还可以从因果掩码的影响中推导出位置意识。