We propose a novel fine-grained quantization method for ternarizing pre-trained full precision models, while also constraining activations to 8-bits. Using this method, we demonstrate minimal loss in classification accuracy on state-of-the-art topologies without additional training. This enables a full 8-bit inference pipeline, with best reported accuracy using ternary weights on ImageNet dataset. Further, we also provide an improved theoretical formulation that forms the basis for a higher quality solution with this approach. Our method involves ternarizing the original weight tensor in groups of $N$ weights. Using $N=4$, we achieve Top-1 accuracy within $3.7\%$ and $5.8\%$ of the baseline full precision result for Resnet-101 and Resnet-50 respectively, while eliminating $75\%$ of all multiplications. We also study the impact of group size on both performance and accuracy. With a group size of $N=64$, we eliminate $\approx99\%$ of the multiplications; however, this introduces a significant drop in accuracy, which necessitates fine tuning the parameters (re-training) at lower precision. To address this, we re-train Resnet-50 with 8-bit activations and ternary weights, improving the Top-1 accuracy to within $4\%$ of the full precision result with $<30\%$ additional overhead. Our final quantized model can run on a full 8-bit compute pipeline using 2-bit weights and has the potential of up to $16\times$ improvement in performance compared to baseline full-precision models.

本文提出了一种精细的量化方法(Fine-grained Quantization, FGQ)，该方法可对预训练的全精度模型进行三值化，同时将激活限制为8位和4位。通过该方法，我们证明了无需额外训练，就可以在最先进的拓扑结构上实现最小分类精度损失。该方法可用于Resnet-101和Resnet-50等模型，可以消除75%的乘法运算，从而实现完整的8/4位推理管道，并在ImageNet数据集上实现最佳报告精度，性能提升潜力高达9倍。最终的量化模型可以在全精度模型的基础上提高15倍的性能。

利用精细量化的三元神经网络