Large language models (LLMs) have demonstrated strong results on a range of
NLP tasks. Typically, outputs are obtained via autoregressive sampling from the
LLM's underlying distribution. We show that this inference strategy can be
suboptimal for a range of tasks and associated evaluation metrics. As a remedy,
we propose metric aware LLM inference: a decision theoretic approach optimizing
for custom metrics at inference time. We report improvements over baselines on
academic benchmarks and publicly available models.

大语言模型 (LLMs) 在一系列自然语言处理任务上取得了出色的结果，但当前的推理策略对于许多任务和评估指标来说并不是最优的。为此，本研究提出了基于度量感知的 LLM 推理方法，通过决策理论在推理过程中针对特定指标进行优化，我们在学术基准和公开模型上取得了改进。

度量感知的 LLM 推理

Metric-aware LLM inference

Inference optimizations are critical for improving user experience and
reducing infrastructure costs and power consumption. In this article, we
illustrate a form of dynamic execution known as speculative sampling to reduce
the overall latency of text generation and compare it with standard
autoregressive sampling. This can be used together with model-based
optimizations (e.g. quantization) to provide an optimized solution. Both
sampling methods make use of KV caching. A Jupyter notebook and some sample
executions are provided.

通过使用推断优化和动态执行中的推测采样方法，结合模型优化技术，如量化，可以提供一种优化解决方案。执行过程中利用了 KV 缓存。

结合推测抽样和 KV-Cache 优化的基于 OpenVINO 的生成式人工智能技术的利用

Leveraging Speculative Sampling and KV-Cache Optimizations Together for  Generative AI using OpenVINO

Autoregressive sampling from large language models has led to
state-of-the-art results in several natural language tasks. However,
autoregressive sampling generates tokens one at a time making it slow, and even
prohibitive in certain tasks. One way to speed up sampling is
$\textit{speculative decoding}$: use a small model to sample a $\textit{draft}$
(block or sequence of tokens), and then score all tokens in the draft by the
large language model in parallel. A subset of the tokens in the draft are
accepted (and the rest rejected) based on a statistical method to guarantee
that the final output follows the distribution of the large model. In this
work, we provide a principled understanding of speculative decoding through the
lens of optimal transport (OT) with $\textit{membership cost}$. This framework
can be viewed as an extension of the well-known $\textit{maximal-coupling}$
problem. This new formulation enables us to generalize the speculative decoding
method to allow for a set of $k$ candidates at the token-level, which leads to
an improved optimal membership cost. We show that the optimal draft selection
algorithm (transport plan) can be computed via linear programming, whose
best-known runtime is exponential in $k$. We then propose a valid draft
selection algorithm whose acceptance probability is $(1-1/e)$-optimal
multiplicatively. Moreover, it can be computed in time almost linear with size
of domain of a single token. Using this $new draft selection$ algorithm, we
develop a new autoregressive sampling algorithm called $\textit{SpecTr}$, which
provides speedup in decoding while ensuring that there is no quality
degradation in the decoded output. We experimentally demonstrate that for
state-of-the-art large language models, the proposed approach achieves a wall
clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on
standard benchmarks.

通过将自回归抽样与猜测解码相结合，提出了一种理论上基于最优传输的规范解码方法，通过使用新的选择算法在保证解码质量的情况下获得解码速度的提升。