We present Inferflow, an efficient and highly configurable inference engine for large language models (LLMs). With Inferflow, users can serve most of the common transformer models by simply modifying some lines in corresponding configuration files, without writing a single line of source code. Compared with most existing inference engines, Inferflow has some key features. First, by implementing a modular framework of atomic build-blocks and technologies, Inferflow is compositionally generalizable to new models. Second, 3.5-bit quantization is introduced in Inferflow as a tradeoff between 3-bit and 4-bit quantization. Third, hybrid model partitioning for multi-GPU inference is introduced in Inferflow to better balance inference speed and throughput than the existing partition-by-layer and partition-by-tensor strategies.

Inferflow是一个有效且高度可配置的推理引擎，适用于大规模语言模型（LLMs）。通过修改相关配置文件中的几行代码，用户可以简单地为大多数常见的Transformer模型提供服务，而无需编写源代码。相比其他推理引擎，Inferflow具有一些关键特性：首先，通过实现原子建模块和技术的模块化框架，Inferflow可以普遍适用于新模型；其次，引入了3.5位量化作为3位量化和4位量化之间的折中；第三，Inferflow引入了混合模型分区以进行多GPU推理，以更好地平衡推理速度和吞吐量，超过了现有的按层分区和按张量分区策略。

Inferflow：一个高效和高度可配置的大语言模型推理引擎