One of the most striking findings in modern research on large language models
(LLMs) is that scaling up compute during training leads to better results.
However, less attention has been given to the benefits of scaling compute
during inference. This survey focuses on these inference-time approaches. We
explore three areas under a unified mathematical formalism: token-level
generation algorithms, meta-generation algorithms, and efficient generation.
Token-level generation algorithms, often called decoding algorithms, operate by
sampling a single token at a time or constructing a token-level search space
and then selecting an output. These methods typically assume access to a
language model's logits, next-token distributions, or probability scores.
Meta-generation algorithms work on partial or full sequences, incorporating
domain knowledge, enabling backtracking, and integrating external information.
Efficient generation methods aim to reduce token costs and improve the speed of
generation. Our survey unifies perspectives from three research communities:
traditional natural language processing, modern LLMs, and machine learning
systems.

通过对大型语言模型的研究，发现在训练过程中提高计算能力可以取得更好的结果，然而对于推断阶段提高计算能力的好处却没有得到足够的关注。本文调查了推断阶段的几种方法，包括基于令牌级别的生成算法、元生成算法和高效生成方法，并从传统自然语言处理、现代大型语言模型和机器学习系统的角度统一了观点。

从解码到元生成：大型语言模型的推理时间算法

From Decoding to Meta-Generation: Inference-time Algorithms for Large  Language Models

We investigate large language model performance across five orders of
magnitude of compute scaling in eleven recent model architectures. We show that
average benchmark performance, aggregating over many individual tasks and
evaluations as in the commonly-used BIG-Bench dataset, is decently predictable
as a function of training compute scale. Specifically, when extrapolating
BIG-Bench Hard performance across one order of magnitude in compute, we observe
average absolute errors of 6 percentage points (pp). By contrast, extrapolation
for individual BIG-Bench tasks across an order of magnitude in compute yields
higher average errors of 18pp. Nonetheless, individual task performance remains
significantly more predictable than chance. Overall, our work suggests compute
scaling provides a promising basis to forecast AI capabilities in diverse
benchmarks, though predicting performance in specific tasks poses challenges.

通过在 11 种最近的模型架构中研究大规模语言模型在五个数量级的计算规模上的表现，我们发现平均基准性能相当可预测，尽管在特定任务中的性能预测具有挑战性，因此计算规模提供了预测人工智能在不同基准测试中能力的有希望的基础。

语言模型基准测试的可预测性如何？

How predictable is language model benchmark performance?

Previous research observed accuracy degradation when replacing the attention
softmax with a point-wise activation such as ReLU. In the context of vision
transformers, we find that this degradation is mitigated when dividing by
sequence length. Our experiments training small to large vision transformers on
ImageNet-21k indicate that ReLU-attention can approach or match the performance
of softmax-attention in terms of scaling behavior as a function of compute.

通过在视觉变换器上进行实验，我们发现当将注意力 softmax 替换为 ReLU 等点层激活时，通过将结果除以序列长度可以减轻准确性下降现象。我们在 ImageNet-21k 上对各种规模的视觉变换器进行训练的实验表明，对于计算扩展性而言，ReLU-attention 的性能可以接近或匹配 softmax-attention。