This paper introduces PowerInfer-2, a framework designed for high-speed inference of Large Language Models (LLMs) on smartphones, particularly effective for models whose sizes exceed the device's memory capacity. The key insight of PowerInfer-2 is to utilize the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. Specifically, PowerInfer-2 features a polymorphic neuron engine that adapts computational strategies for various stages of LLM inference. Additionally, it introduces segmented neuron caching and fine-grained neuron-cluster-level pipelining, which effectively minimize and conceal the overhead caused by I/O operations. The implementation and evaluation of PowerInfer-2 demonstrate its capability to support a wide array of LLM models on two smartphones, achieving up to a 29.2x speed increase compared with state-of-the-art frameworks. Notably, PowerInfer-2 is the first system to serve the TurboSparse-Mixtral-47B model with a generation rate of 11.68 tokens per second on a smartphone. For models that fit entirely within the memory, PowerInfer-2 can achieve approximately a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM. For more details, including a demonstration video, please visit the project site at www.powerinfer.ai/v2.

PowerInfer-2是一个为智能手机上的大型语言模型（LLM）进行高速推断而设计的框架，通过将传统的矩阵计算分解为细粒度的神经元集群计算，利用智能手机中的异构计算、内存和I/O资源，实现了多样的计算策略，减少了I/O操作带来的开销。在两部智能手机上的实现和评估表明，相比于现有的框架，PowerInfer-2实现了高达29.2倍的速度提升，并且是第一个在智能手机上具有11.68 tokens每秒生成速率的TurboSparse-Mixtral-47B模型服务的系统。对于完全适合内存的模型，PowerInfer-2可以实现大约40%的内存使用减少，并且维持与llama.cpp和MLC-LLM相当的推断速度。

PowerInfer-2：智能手机上快速的大型语言模型推断