基于Llama.cpp的Armv9架构通用大语言模型推理性能优化

Jun, 2024

Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp

Longhao Chen, Yina Zhao, Qiangjun Xie, Qinghua Sheng

TL;DR通过进行Int8量化，对llama.cpp中的一些运算符进行矢量化，并修改编译脚本以提高编译器优化水平，优化了Qwen-1.8B模型的推断性能。在Yitian 710实验平台上，填充性能提高了1.6倍，解码性能提高了24倍，内存使用量减少到原来的1/5，准确率损失几乎可以忽略不计。

Abstract

This article optimizes the inference performance of the qwen-1.8b model by performing int8 quantization, vectorizing some operators in llama.cpp<