The majority of the research on the quantization of Deep Neural Networks
(DNNs) is focused on reducing the precision of tensors visible by high-level
frameworks (e.g., weights, activations, and gradients). However, current
hardware still relies on high-accuracy core operations. Most significant is the
operation of accumulating products. This high-precision accumulation operation
is gradually becoming the main computational bottleneck. This is because, so
far, the usage of low-precision accumulators led to a significant degradation
in performance. In this work, we present a simple method to train and fine-tune
high-end DNNs, to allow, for the first time, utilization of cheaper, $12$-bits
accumulators, with no significant degradation in accuracy. Lastly, we show that
as we decrease the accumulation precision further, using fine-grained gradient
approximations can improve the DNN accuracy.

我们提出了一种简单的方法来训练和微调高端深度神经网络，首次允许使用更廉价的 12 位累加器，而不会出现显著的准确度降低。最后，我们证明，通过进一步降低累加器的精度，并使用细粒度梯度逼近可以提高深度神经网络的准确性。

深度网络中使用较低位宽累加器实现更廉价的推理

Towards Cheaper Inference in Deep Networks with Lower Bit-Width  Accumulators

We present accumulator-aware quantization (A2Q), a novel weight quantization
method designed to train quantized neural networks (QNNs) to avoid overflow
when using low-precision accumulators during inference. A2Q introduces a unique
formulation inspired by weight normalization that constrains the L1-norm of
model weights according to accumulator bit width bounds that we derive. Thus,
in training QNNs for low-precision accumulation, A2Q also inherently promotes
unstructured weight sparsity to guarantee overflow avoidance. We apply our
method to deep learning-based computer vision tasks to show that A2Q can train
QNNs for low-precision accumulators while maintaining model accuracy
competitive with a floating-point baseline. In our evaluations, we consider the
impact of A2Q on both general-purpose platforms and programmable hardware.
However, we primarily target model deployment on FPGAs because they can be
programmed to fully exploit custom accumulator bit widths. Our experimentation
shows accumulator bit width significantly impacts the resource efficiency of
FPGA-based accelerators. On average across our benchmarks, A2Q offers up to a
2.3x reduction in resource utilization over 32-bit accumulator counterparts
with 99.2% of the floating-point model accuracy.

我们提出了一种适用于训练量化神经网络（QNNs）以避免在推断过程中使用低精度累加器时的溢出问题的新颖的权重量化方法 ——accumulator-aware quantization（A2Q）。A2Q 引入了一种受权重归一化启发的独特公式，根据我们导出的累加器比特宽度限制来约束模型权重的 L1 范数。因此，在训练低精度累加器的 QNNs 时，A2Q 还本质上促进了非结构化权重稀疏性以保证溢出避免。我们将该方法应用于基于深度学习的计算机视觉任务，以表明 A2Q 可以在保持与浮点基准相竞争的模型准确性的同时，训练适用于低精度累加器的 QNNs。在我们的评估中，我们考虑了 A2Q 对通用平台和可编程硬件的影响。然而，我们主要针对在 FPGAs 上部署模型，因为它们可以被编程以充分利用定制累加器比特宽度。我们的实验表明，累加器比特宽度显著影响基于 FPGA 的加速器的资源效率。在我们的基准测试中，A2Q 相比于 32 位累加器对应物平均提供高达 2.3 倍的资源利用率降低，同时保持 99.2% 的浮点模型准确性。