The high power consumption and latency-sensitive deployments of large
language models (LLMs) have motivated techniques like quantization and
sparsity. Contextual sparsity, where the sparsity pattern is input-dependent,
is crucial in LLMs because the permanent removal of attention heads or neurons
from LLMs can significantly degrade accuracy. Prior work has attempted to model
contextual sparsity using neural networks trained to predict activation
magnitudes, which can be used to dynamically prune structures with low
predicted activation magnitude. In this paper, we look beyond magnitude-based
pruning criteria to assess attention head and neuron importance in LLMs. We
developed a novel predictor called ShadowLLM, which can shadow the LLM behavior
and enforce better sparsity patterns, resulting in over 15% improvement in
end-to-end accuracy without increasing latency compared to previous methods.
ShadowLLM achieves up to a 20\% speed-up over the state-of-the-art DejaVu
framework. These enhancements are validated on models with up to 30 billion
parameters. Our code is available at
\href{this https URL}{ShadowLLM}.

使用 ShadowLLM 预测器可实现更好的稀疏模式，提高 15% 的准确率，同时减少 20% 的延迟，验证了具有 300 亿参数的模型。

ShadowLLM: 基于预测的上下文稀疏化大语言模型

ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models

Large language models (LLMs) with hundreds of billions of parameters have
sparked a new wave of exciting AI applications. However, they are
computationally expensive at inference time. Sparsity is a natural approach to
reduce this cost, but existing methods either require costly retraining, have
to forgo LLM's in-context learning ability, or do not yield wall-clock time
speedup on modern hardware. We hypothesize that contextual sparsity, which are
small, input-dependent sets of attention heads and MLP parameters that yield
approximately the same output as the dense model for a given input, can address
these issues. We show that contextual sparsity exists, that it can be
accurately predicted, and that we can exploit it to speed up LLM inference in
wall-clock time without compromising LLM's quality or in-context learning
ability. Based on these insights, we propose DejaVu, a system that uses a
low-cost algorithm to predict contextual sparsity on the fly given inputs to
each layer, along with an asynchronous and hardware-aware implementation that
speeds up LLM inference. We validate that DejaVu can reduce the inference
latency of OPT-175B by over 2X compared to the state-of-the-art
FasterTransformer, and over 6X compared to the widely used Hugging Face
implementation, without compromising model quality. The code is available at
this https URL

使用上下文稀疏性预测算法和异步硬件感知实现，提出了 DejaVu 系统，可在不影响模型质量的情况下将 OPT-175B 的推理延迟降低了 2 倍，并且相比于最先进的 FasterTransformer 实现和广泛使用的 Hugging Face 实现，可降低推理延迟超过 6 倍。