Speculative Decoding (SD) has become an important technique in accelerating the inference speed of large language models. Conventional SD methods employ a fixed draft length, which ignores the token generation difficulty across tasks. Consequently, in this paper, we address such an issue and introduce SVIP - a difficulty-aware dynamic draft length policy for speculative decoding systems. Based on a theoretical lower bound of draft token acceptance rate and its inference-time approximation, SVIP adaptively determines the lengths of draft sequences based on the entropy of each draft token distribution. Experimental results on mainstream SD benchmarks and frameworks demonstrate the superior performance of SVIP, achieving up to 20\% walltime speedup on SpecBench over baseline SD methods and 60\% speedup on MT-Bench for long-form generation of up to 8K tokens. Moreover, SVIP is totally training-free and compatible with any existing SD methods that generate draft tokens autoregressively. Experimental results also show that SVIP yields consistent walltime improvement on top of GliDe & CaPE and EAGLE-2.

本研究针对投机解码（SD）中传统方法固定草稿长度的问题，提出了一种新的难度感知动态草稿长度策略SVIP。SVIP能够根据草稿词元分布的熵自适应调整草稿序列长度，实验结果表明其在主要SD基准测试中较基线方法最高可实现20\%的墙面时间加速，具有显著的加速效果和兼容性。

草稿模型知道何时停止：一种用于投机解码的自我验证长度策略