This paper introduces PowerInfer, a high-speed Large Language Model (LLM)
inference engine on a personal computer (PC) equipped with a single
consumer-grade GPU. The key underlying the design of PowerInfer is exploiting
the high locality inherent in LLM inference, characterized by a power-law
distribution in neuron activation. This distribution indicates that a small
subset of neurons, termed hot neurons, are consistently activated across
inputs, while the majority, cold neurons, vary based on specific inputs.
PowerInfer exploits such an insight to design a GPU-CPU hybrid inference
engine: hot-activated neurons are preloaded onto the GPU for fast access, while
cold-activated neurons are computed on the CPU, thus significantly reducing GPU
memory demands and CPU-GPU data transfers. PowerInfer further integrates
adaptive predictors and neuron-aware sparse operators, optimizing the
efficiency of neuron activation and computational sparsity. Evaluation shows
that PowerInfer attains an average token generation rate of 13.20 tokens/s,
with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a
single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier
server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x
while retaining model accuracy.

PowerInfer 是一个高速的 GPU-CPU 混合推理引擎，利用大型语言模型 (LLM) 推理中固有的高局部性，并通过预加载热激活的神经元到 GPU 以快速访问、在 CPU 上计算冷激活的神经元，从而显著降低 GPU 内存需求和 CPU-GPU 数据传输，并且通过自适应预测器和神经元感知稀疏操作进一步优化神经元激活和计算稀疏性，评估结果显示，在单个 NVIDIA RTX 4090 GPU 上，PowerInfer 在各种 LLM (包括 OPT-175B) 上实现了平均 13.20 令牌 / 秒的生成速率，峰值为 29.08 令牌 / 秒，仅比顶级服务器级 A100 GPU 低 18%，相比于 llama.cpp 最大提升了 11.69 倍，仍保持着模型精度。