This work introduces an efficient method to scale Transformer-based Large
Language Models (LLMs) to infinitely long inputs with bounded memory and
computation. A key component in our proposed approach is a new attention
technique dubbed Infini-attention. The Infini-attention incorporates a
compressive memory into the vanilla attention mechanism and builds in both
masked local attention and long-term linear attention mechanisms in a single
Transformer block. We demonstrate the effectiveness of our approach on
long-context language modeling benchmarks, 1M sequence length passkey context
block retrieval and 500K length book summarization tasks with 1B and 8B LLMs.
Our approach introduces minimal bounded memory parameters and enables fast
streaming inference for LLMs.

该研究介绍了一种有效的方法，用于将基于 Transformer 的大型语言模型扩展到无限长的输入，同时保证有界的内存和计算。我们提出的方法的关键组成部分是一种称为 Infini-attention 的新的注意力技术，它将压缩性记忆融入到传统的注意力机制中，并在单个 Transformer 块中集成了被屏蔽的局部注意力和长期线性注意力机制。我们在长文本语言建模、1M 序列长度密钥上下文块检索和 500K 长度的书籍摘要任务上展示了我们方法的有效性，使用 1B 和 8B 规模的大型语言模型。我们的方法引入了最小化的有界内存参数，并实现了 LLMs 的快速流式推理。