Current language models often fail to incorporate long contexts efficiently
during generation. We show that a major contributor to this issue are attention
priors that are likely learned during pre-training: relevant information
located earlier in context is attended to less on average. Yet even when models
fail to use the information from a relevant document in their response, they
still pay preferential attention to that document compared to an irrelevant
document at the same position. We leverage this fact to introduce ``attention
sorting'': perform one step of decoding, sort documents by the attention they
receive (highest attention going last), repeat the process, generate the answer
with the newly sorted context. We find that attention sorting improves
performance of long context models. Our findings highlight some challenges in
using off-the-shelf language models for retrieval augmented generation.

当前语言模型在生成过程中常常无法高效地整合长文本上下文。我们发现这个问题的主要原因是在预训练过程中很可能学到的注意力先验知识：文本上下文中较早出现的相关信息平均上受到较少关注。然而，即使模型未能使用相关文档的信息来生成回答，它们在同一位置上仍然会对与无关文档相比表现出更多关注。基于这一事实，我们利用 “注意力排序” 来改进长文本模型的性能：在解码过程中进行一步操作，以所接收到的注意力对文档进行排序（最高注意力排序最后），然后重复该过程，生成新排序文本的回答。我们的研究结果突出了使用现成的语言模型进行检索增强生成时的一些挑战。

注意力排序在长上下文语言模型中对抗最近偏差

Attention Sorting Combats Recency Bias In Long Context Language Models

We present Position Interpolation (PI) that extends the context window sizes
of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal
fine-tuning (within 1000 steps), while demonstrating strong empirical results
on various tasks that require long context, including passkey retrieval,
language modeling, and long document summarization from LLaMA 7B to 65B.
Meanwhile, the extended model by Position Interpolation preserve quality
relatively well on tasks within its original context window. To achieve this
goal, Position Interpolation linearly down-scales the input position indices to
match the original context window size, rather than extrapolating beyond the
trained context length which may lead to catastrophically high attention scores
that completely ruin the self-attention mechanism. Our theoretical study shows
that the upper bound of interpolation is at least $\sim 600 \times$ smaller
than that of extrapolation, further demonstrating its stability. Models
extended via Position Interpolation retain its original architecture and can
reuse most pre-existing optimization and infrastructure.

本文介绍了一种名为 Position Interpolation 的方法，它扩展了 RoPE-based pretrained LLMs 的上下文窗口大小，可以达到 32768，而且只需要最小限度的微调，同时在需要长上下文的各种任务中（包括密码检索、语言建模和长文档摘要等）展示了强大的实证结果。

通过位置插值扩展大型语言模型的上下文窗口

Extending Context Window of Large Language Models via Positional  Interpolation

Transformers-based models, such as BERT, have been one of the most successful
deep learning models for NLP. Unfortunately, one of their core limitations is
the quadratic dependency (mainly in terms of memory) on the sequence length due
to their full attention mechanism. To remedy this, we propose, BigBird, a
sparse attention mechanism that reduces this quadratic dependency to linear. We
show that BigBird is a universal approximator of sequence functions and is
Turing complete, thereby preserving these properties of the quadratic, full
attention model. Along the way, our theoretical analysis reveals some of the
benefits of having $O(1)$ global tokens (such as CLS), that attend to the
entire sequence as part of the sparse attention mechanism. The proposed sparse
attention can handle sequences of length up to 8x of what was previously
possible using similar hardware. As a consequence of the capability to handle
longer context, BigBird drastically improves performance on various NLP tasks
such as question answering and summarization. We also propose novel
applications to genomics data.

本文介绍了基于 Transformers 模型 (BERT) 的缺点，提出了一种新模型 BigBird，该模型采用稀疏注意机制以线性方式减少了模型中由全面关注机制导致的二次依赖性 (主要是内存)，能够处理比以前长 8 倍长度的序列。因其能够处理更长的上下文，BigBird 在各种 NLP 任务上都实现了大幅度的性能提升。