Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2.

本研究解决了现有推测采样方法在大词汇量语言模型（如Llama-3-8B）中效率大幅降低的问题。提出的FR-Spec框架通过压缩词汇空间并优化候选选择，减少了75%的语言模型头计算开销，同时保持最终输出分布的一致性。实验结果表明，该方法在多个数据集上较于最先进的EAGLE-2方法实现了平均1.12倍的加速。

FR-Spec: 通过频率排名的推测采样加速大词汇量语言模型