Speculative decoding has demonstrated its effectiveness in accelerating the
inference of large language models while maintaining a consistent sampling
distribution. However, the conventional approach of training a separate draft
model to achieve a satisfactory token acceptance rate can be costly. Drawing
inspiration from early exiting, we propose a novel self-speculative decoding
framework \emph{Kangaroo}, which uses a fixed shallow sub-network as a
self-draft model, with the remaining layers serving as the larger target model.
We train a lightweight and efficient adapter module on top of the sub-network
to bridge the gap between the sub-network and the full model's representation
ability. It is noteworthy that the inference latency of the self-draft model
may no longer be negligible compared to the large model, necessitating
strategies to increase the token acceptance rate while minimizing the drafting
steps of the small model. To address this challenge, we introduce an additional
early exiting mechanism for generating draft tokens. Specifically, we halt the
small model's subsequent prediction during the drafting phase once the
confidence level for the current token falls below a certain threshold.
Extensive experiments on the Spec-Bench demonstrate the effectiveness of
Kangaroo. Under single-sequence verification, Kangaroo achieves speedups up to
$1.68\times$ on Spec-Bench, outperforming Medusa-1 with 88.7\% fewer additional
parameters (67M compared to 591M). The code for Kangaroo is available at
this https URL

使用浅层子网络作为自草稿模型，采用早期停止方式提高令牌接受率，Kangaroo 算法在大型语言模型中实现了加速，并通过 Spec-Bench 的实验证明了其有效性。

袋鼠：无损自我推测解码技术双早期退出

Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

We present LayerSkip, an end-to-end solution to speed-up inference of large
language models (LLMs). First, during training we apply layer dropout, with low
dropout rates for earlier layers and higher dropout rates for later layers, and
an early exit loss where all transformer layers share the same exit. Second,
during inference, we show that this training recipe increases the accuracy of
early exit at earlier layers, without adding any auxiliary layers or modules to
the model. Third, we present a novel self-speculative decoding solution where
we exit at early layers and verify and correct with remaining layers of the
model. Our proposed self-speculative decoding approach has less memory
footprint than other speculative decoding approaches and benefits from shared
compute and activations of the draft and verification stages. We run
experiments on different Llama model sizes on different types of training:
pretraining from scratch, continual pretraining, finetuning on specific data
domain, and finetuning on specific task. We implement our inference solution
and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x
on coding, and 2.0x on TOPv2 semantic parsing task.

通过应用层丢弃和早期退出损失的训练技术，在推理过程中加快大型语言模型的速度，并推出了一种新颖的自我推测编码解决方案，该解决方案减少了内存占用，并在不同训练任务上实现了高达 2.16 倍的加速。

层级跳过：在推断中实现早期退出和自我推测解码

Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding

We present a novel inference scheme, self-speculative decoding, for
accelerating Large Language Models (LLMs) without the need for an auxiliary
model. This approach is characterized by a two-stage process: drafting and
verification. The drafting stage generates draft tokens at a slightly lower
quality but more quickly, which is achieved by selectively skipping certain
intermediate layers during drafting Subsequently, the verification stage
employs the original LLM to validate those draft output tokens in one forward
pass. This process ensures the final output remains identical to that produced
by the unaltered LLM, thereby maintaining output quality. The proposed method
requires no additional neural network training and no extra memory footprint,
making it a plug-and-play and cost-effective solution for inference
acceleration. Benchmarks with LLaMA-2 and its fine-tuned models demonstrated a
speedup up to 1.73$\times$.

我们提出了一种新颖的推理方案，自我推测解码，用于加速大型语言模型（LLMs），无需辅助模型。该方法通过两个阶段的过程来实现：草稿和验证。草稿阶段以稍低质量但更快的速度生成草稿标记，通过在草稿期间选择性跳过某些中间层来实现。然后，验证阶段使用原始 LLM 在一次前向传递中验证那些草稿输出标记。该过程确保最终输出与未经修改的 LLM 产生的输出完全相同，从而保持输出质量。所提出的方法不需要额外的神经网络训练和额外的内存占用，是一种即插即用和经济高效的推理加速解决方案。与 LLaMA-2 及其微调模型的基准测试表明，加速比最高可达 1.73 倍。