The recent rise of large language models (LLMs) has resulted in increased efforts towards running LLMs at reduced precision. Running LLMs at lower precision supports resource constraints and furthers their democratization, enabling users to run billion-parameter LLMs on their personal devices. To supplement this ongoing effort, we propose INT-FP-QSim: an open-source simulator that enables flexible evaluation of LLMs and vision transformers at various numerical precisions and formats. INT-FP-QSim leverages existing open-source repositories such as TensorRT, QPytorch and AIMET for a combined simulator that supports various floating point and integer formats. With the help of our simulator, we survey the impact of different numerical formats on the performance of LLMs and vision transformers at 4-bit weights and 4-bit or 8-bit activations. We also compare recently proposed methods like Adaptive Block Floating Point, SmoothQuant, GPTQ and RPTQ on the model performances. We hope INT-FP-QSim will enable researchers to flexibly simulate models at various precisions to support further research in quantization of LLMs and vision transformers.

最近大规模语言模型(LLMs)的兴起导致了对降低精度的LLMs的增加，为了解决资源限制和促进民主化进程，我们提出了INT-FP-QSim: 一个开源模拟器，可以在不同的数值精度和格式下灵活评估LLMs和视觉转换器。通过我们的模拟器，我们调查了不同数值格式对4位权重和4位或8位激活的LLMs和视觉转换器性能的影响，并比较了Adaptive Block Floating Point、SmoothQuant、GPTQ和RPTQ等最近提出的方法在模型性能上的表现。我们希望INT-FP-QSim能够使研究人员灵活地模拟不同精度的模型，以支持进一步的LLMs和视觉转换器的量化研究。

INT-FP-QSim：大型语言模型和视觉变换器的混合精度和格式