Excessive memory requirements of key and value features (KV-cache) present
significant challenges in the autoregressive inference of large language models
(LLMs), restricting both the speed and length of text generation. Approaches
such as Multi-Query Attention (MQA) and Grouped Query Attention (GQA) mitigate
these challenges by grouping query heads and consequently reducing the number
of corresponding key and value heads. However, MQA and GQA decrease the
KV-cache size requirements at the expense of LLM accuracy (quality of text
generation). These methods do not ensure an optimal tradeoff between KV-cache
size and text generation quality due to the absence of quality-aware grouping
of query heads. To address this issue, we propose Quality and Capacity-Aware
Grouped Query Attention (QCQA), which identifies optimal query head groupings
using an evolutionary algorithm with a computationally efficient and
inexpensive fitness function. We demonstrate that QCQA achieves a significantly
better tradeoff between KV-cache capacity and LLM accuracy compared to GQA. For
the Llama2 $7\,$B model, QCQA achieves $\mathbf{20}$\% higher accuracy than GQA
with similar KV-cache size requirements in the absence of fine-tuning. After
fine-tuning both QCQA and GQA, for a similar KV-cache size, QCQA provides
$\mathbf{10.55}\,$\% higher accuracy than GQA. Furthermore, QCQA requires
$40\,$\% less KV-cache size than GQA to attain similar accuracy. The proposed
quality and capacity-aware grouping of query heads can serve as a new paradigm
for KV-cache optimization in autoregressive LLM inference.

提出了一种考虑质量和能力的查询头分组的方法，用于在自回归大型语言模型推断中进行关键值缓存优化。该方法能够以较少的关键值缓存需求达到与其他方法相似的准确性，并且在细调后较其他方法具有较高的准确性。

QCQA：质量和容量感知的分组查询注意力

QCQA: Quality and Capacity-aware grouped Query Attention

Large language models (LLMs) can solve challenging tasks. However, their
inference computation on modern GPUs is highly inefficient due to the
increasing number of tokens they must attend to as they generate new ones. To
address this inefficiency, we capitalize on LLMs' problem-solving capabilities
to optimize their own inference-time efficiency. We demonstrate with two
specific tasks: (a) evaluating complex arithmetic expressions and (b)
summarizing news articles. For both tasks, we create custom datasets to
fine-tune an LLM. The goal of fine-tuning is twofold: first, to make the LLM
learn to solve the evaluation or summarization task, and second, to train it to
identify the minimal attention spans required for each step of the task. As a
result, the fine-tuned model is able to convert these self-identified minimal
attention spans into sparse attention masks on-the-fly during inference. We
develop a custom CUDA kernel to take advantage of the reduced context to attend
to. We demonstrate that using this custom CUDA kernel improves the throughput
of LLM inference by 28%. Our work presents an end-to-end demonstration showing
that training LLMs to self-select their attention spans speeds up
autoregressive inference in solving real-world tasks.

训练大型语言模型自我选择注意力跨度可以加快解决现实世界任务的自回归推理速度。

自选注意力范围加速大型语言模型推理

Self-Selected Attention Span for Accelerating Large Language Model  Inference

Can a mere next-token predictor faithfully model human intelligence? We
crystallize this intuitive concern, which is fragmented in the literature. As a
starting point, we argue that the two often-conflated phases of next-token
prediction -- autoregressive inference and teacher-forced training -- must be
treated distinctly. The popular criticism that errors can compound during
autoregressive inference, crucially assumes that teacher-forcing has learned an
accurate next-token predictor. This assumption sidesteps a more deep-rooted
problem we expose: in certain classes of tasks, teacher-forcing can simply fail
to learn an accurate next-token predictor in the first place. We describe a
general mechanism of how teacher-forcing can fail, and design a minimal
planning task where both the Transformer and the Mamba architecture empirically
fail in that manner -- remarkably, despite the task being straightforward to
learn. We provide preliminary evidence that this failure can be resolved when
training to predict multiple tokens in advance. We hope this finding can ground
future debates and inspire explorations beyond the next-token prediction
paradigm. We make our code available under
this https URL

通过模型中的 autoregressive inference 和 teacher-forced training 两个关键阶段的独立处理来解决关于 next-token 预测的问题，研究揭示了在特定类的任务中，teacher-forcing 不仅可能在 autoregressive inference 阶段出现错误叠加的问题，还可能在首次学习过程中就无法准确预测下一个 token 的问题。研究通过实验证明了这一问题，并提出通过预测多个 token 来解决这一失败情况的初步证据。这一发现希望能够引发关于 next-token 预测范式之外的讨论和探索。

下一个标记预测的陷阱

The pitfalls of next-token prediction

Regenerating natural language explanations in the scientific domain has been
proposed as a benchmark to evaluate complex multi-hop and explainable
inference. In this context, large language models can achieve state-of-the-art
performance when employed as cross-encoder architectures and fine-tuned on
human-annotated explanations. However, while much attention has been devoted to
the quality of the explanations, the problem of performing inference
efficiently is largely under-studied. Cross-encoders, in fact, are
intrinsically not scalable, possessing limited applicability to real-world
scenarios that require inference on massive facts banks. To enable complex
multi-hop reasoning at scale, this paper focuses on bi-encoder architectures,
investigating the problem of scientific explanation regeneration at the
intersection of dense and sparse models. Specifically, we present SCAR (for
Scalable Autoregressive Inference), a hybrid framework that iteratively
combines a Transformer-based bi-encoder with a sparse model of explanatory
power, designed to leverage explicit inference patterns in the explanations.
Our experiments demonstrate that the hybrid framework significantly outperforms
previous sparse models, achieving performance comparable with that of
state-of-the-art cross-encoders while being approx 50 times faster and scalable
to corpora of millions of facts. Further analyses on semantic drift and
multi-hop question answering reveal that the proposed hybridisation boosts the
quality of the most challenging explanations, contributing to improved
performance on downstream inference tasks.

研究了在自然语言解释中使用双编码器模型进行科学推理，提出了一个名为 SCAR 的混合框架，该框架结合了基于变压器的双编码器和稀疏模型，能够在大规模事实库上实现复杂的多跳推理，并提高了下游推理任务的性能表现。