Scaling Large Language Models (LLMs) with extended context lengths has increased the need for efficient low-bit quantization to manage their substantial computational demands. However, reducing precision to 4 bits frequently degrades performance due to activation outliers. To address this, we propose Asymmetric Microscaling 4-bit Floating-Point (AMXFP4) for efficient LLM inference. This novel data format leverages asymmetric shared scales to mitigate outliers while naturally capturing the asymmetry introduced by group-wise quantization. Unlike conventional 4-bit quantization methods that rely on data rotation and costly calibration, AMXFP4 uses asymmetric shared scales for direct 4-bit casting, achieving near-ideal quantization accuracy across various LLM tasks, including multi-turn conversations, long-context reasoning, and visual question answering. Our AMXFP4 format significantly outperforms MXFP4 and other leading quantization techniques, enabling robust, calibration-free 4-bit inference.

本研究解决了在扩展上下文长度的大语言模型推理中，低精度量化导致性能下降的问题。提出的不对称微缩4位浮点格式（AMXFP4）利用不对称共享尺度减少激活异常值的影响，显著提高了4位量化精度。AMXFP4在多轮对话、长期推理和视觉问答等多种任务中，表现优于传统方法，支持无校准的稳健推理。

AMXFP4：通过不对称微缩浮点技术驯服激活异常值以进行4位大语言模型推理