Position encoding in transformer architecture provides supervision for dependency modeling between elements at different positions in the sequence. We investigate various methods to encode positional information in transformer-based language models and propose a novel implementation named Rotary Position Embedding(RoPE). The proposed RoPE encodes absolute positional information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation. Notably, RoPE comes with valuable properties such as flexibility of being expand to any sequence lengths, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear self-attention with relative position encoding. As a result, the enhanced transformer with rotary position embedding, or RoFormer, achieves superior performance in tasks with long texts. We release the theoretical analysis along with some preliminary experiment results on Chinese data. The undergoing experiment for English benchmark will soon be updated.

本篇论文研究了在语言模型中如何整合位置信息，并提出了一种名为RoPE的方法，它可以将位置信息编码为旋转矩阵，并同时将显式的相对位置依赖性结合到自注意力公式中。实验结果表明，RoPE使transformer在处理长文本分类问题时表现出优越的性能。

RoFormer: 带旋转位置嵌入的增强Transformer