在CPU上高效LLM推断

Nov, 2023

Efficient LLM Inference on CPUs

Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, Hengyu Meng

TL;DR本论文提出了一种有效的方法，可以更高效地部署大型语言模型，通过自动INT4纯权重量化流和设计具有高度优化内核的特殊LLM运行时，在CPU上加速LLM推理，展示了该方法对包括Llama2、Llama、GPT-NeoX等流行LLM的普适性，并显示了在CPU上的极高推理效率。

Abstract

large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks. However, deploying these models has been challenging due to the astronomical amount of model parameters, which requires a demand for large memory capacity and hi