This paper presents a comprehensive analysis of quantization techniques for optimizing Large Language Models (LLMs), specifically focusing on Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). Through empirical evaluation across models ranging from 10M to 1B parameters, we demonstrate that quantization can achieve up to 68% reduction in model size while maintaining performance within 6% of full-precision baselines when utilizing our proposed scaling factor {\gamma}. Our experiments show that INT8 quantization delivers a 40% reduction in computational cost and power consumption, while INT4 quantization further improves these metrics by 60%. We introduce a novel theoretical framework for mixed-precision quantization, deriving optimal bit allocation strategies based on layer sensitivity and weight variance. Hardware efficiency evaluations on edge devices reveal that our quantization approach enables up to 2.4x throughput improvement for INT8 and 3x for INT4, with 60% power reduction compared to full-precision models.

本研究针对大型语言模型的优化问题，评估了后训练量化（PTQ）和量化感知训练（QAT）两种量化技术。研究提出了一种新的理论框架，可以通过层灵敏度和权重方差来推导最佳比特分配策略，实验表明该方法可在显著降低模型大小和计算成本的同时保持性能。最显著的发现是，该量化方法在边缘设备上实现了大幅度的吞吐量提升和功耗降低。

通过量化优化大型语言模型：PTQ与QAT技术的比较分析