Current language models often fail to incorporate long contexts efficiently
during generation. We show that a major contributor to this issue are attention
priors that are likely learned during pre-training: relevant information
located earlier in context is attended to less on average. Yet even when models
fail to use the information from a relevant document in their response, they
still pay preferential attention to that document compared to an irrelevant
document at the same position. We leverage this fact to introduce ``attention
sorting'': perform one step of decoding, sort documents by the attention they
receive (highest attention going last), repeat the process, generate the answer
with the newly sorted context. We find that attention sorting improves
performance of long context models. Our findings highlight some challenges in
using off-the-shelf language models for retrieval augmented generation.

当前语言模型在生成过程中常常无法高效地整合长文本上下文。我们发现这个问题的主要原因是在预训练过程中很可能学到的注意力先验知识：文本上下文中较早出现的相关信息平均上受到较少关注。然而，即使模型未能使用相关文档的信息来生成回答，它们在同一位置上仍然会对与无关文档相比表现出更多关注。基于这一事实，我们利用 “注意力排序” 来改进长文本模型的性能：在解码过程中进行一步操作，以所接收到的注意力对文档进行排序（最高注意力排序最后），然后重复该过程，生成新排序文本的回答。我们的研究结果突出了使用现成的语言模型进行检索增强生成时的一些挑战。