Extending context window sizes allows large language models (LLMs) to process longer sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has become the de facto standard due to its relative positional encoding properties that benefit long-context training. However, we observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding, especially in long-context scenarios. This issue arises from BFloat16's limited precision and accumulates as context length increases, with the first token contributing significantly to this problem. To address this, we develop AnchorAttention, a plug-and-play attention method that alleviates numerical issues caused by BFloat16, improves long-context capabilities, and speeds up training. AnchorAttention reduces unnecessary attention computations, maintains semantic coherence, and boosts computational efficiency by treating the first token as a shared anchor with a consistent position ID, making it visible to all documents within the training context. Experiments on three types of LLMs demonstrate that AnchorAttention significantly improves long-context performance and reduces training time by over 50\% compared to standard full attention mechanisms, while preserving the original LLM's capabilities on general tasks. Our code is available at https://github.com/haonan3/AnchorContext.

本研究针对使用BFloat16格式的Rotary Positional Embedding (RoPE)在长上下文训练中存在的数值问题进行了分析。这篇论文提出了AnchorAttention，一种新的注意力机制，它通过将第一个token视为共享锚点，解决了BFloat16的精度限制问题，提高了长上下文处理能力，并减少了超过50%的训练时间。实验表明，AnchorAttention可以显著改进长上下文性能，同时保持大语言模型在常规任务中的能力。

当精度遇到位置：BFloat16在长上下文训练中打破RoPE