Quantization techniques commonly reduce the inference costs of neural
networks by restricting the precision of weights and activations. Recent
studies show that also reducing the precision of the accumulator can further
improve hardware efficiency at the risk of numerical overflow, which introduces
arithmetic errors that can degrade model accuracy. To avoid numerical overflow
while maintaining accuracy, recent work proposed accumulator-aware quantization
(A2Q), a quantization-aware training method that constrains model weights
during training to safely use a target accumulator bit width during inference.
Although this shows promise, we demonstrate that A2Q relies on an overly
restrictive constraint and a sub-optimal weight initialization strategy that
each introduce superfluous quantization error. To address these shortcomings,
we introduce: (1) an improved bound that alleviates accumulator constraints
without compromising overflow avoidance; and (2) a new strategy for
initializing quantized weights from pre-trained floating-point checkpoints. We
combine these contributions with weight normalization to introduce A2Q+. We
support our analysis with experiments that show A2Q+ significantly improves the
trade-off between accumulator bit width and model accuracy and characterize new
trade-offs that arise as a consequence of accumulator constraints.

通过限制权重和激活函数的精度，量化技术通常降低神经网络推理成本。最近的研究表明，降低累加器的精度可以进一步提高硬件效率，但存在数值溢出的风险，这会导致算术错误并降低模型的准确性。为了避免数值溢出并保持准确性，最新的工作提出了一种称为累加器感知量化（A2Q）的量化感知训练方法，在训练期间约束模型权重以在推理过程中安全地使用目标累加器位宽。尽管这显示出了潜力，但我们证明 A2Q 依赖于过于严格的约束和亚优化的权重初始化策略，每个都引入了多余的量化误差。为了解决这些缺点，我们引入了两个改进：（1）一种改进的上界，缓解了累加器约束而不损害溢出避免；（2）一种从预训练的浮点检查点初始化量化权重的新策略。我们将这些贡献与权重归一化结合起来，引入 A2Q+。我们通过实验证实，A2Q + 显著改善了累加器位宽和模型准确性之间的权衡，并表征了累加器约束引起的新的权衡。

A2Q+: 提高累加器感知的权重量化

A2Q+: Improving Accumulator-Aware Weight Quantization

We present accumulator-aware quantization (A2Q), a novel weight quantization
method designed to train quantized neural networks (QNNs) to avoid overflow
when using low-precision accumulators during inference. A2Q introduces a unique
formulation inspired by weight normalization that constrains the L1-norm of
model weights according to accumulator bit width bounds that we derive. Thus,
in training QNNs for low-precision accumulation, A2Q also inherently promotes
unstructured weight sparsity to guarantee overflow avoidance. We apply our
method to deep learning-based computer vision tasks to show that A2Q can train
QNNs for low-precision accumulators while maintaining model accuracy
competitive with a floating-point baseline. In our evaluations, we consider the
impact of A2Q on both general-purpose platforms and programmable hardware.
However, we primarily target model deployment on FPGAs because they can be
programmed to fully exploit custom accumulator bit widths. Our experimentation
shows accumulator bit width significantly impacts the resource efficiency of
FPGA-based accelerators. On average across our benchmarks, A2Q offers up to a
2.3x reduction in resource utilization over 32-bit accumulator counterparts
with 99.2% of the floating-point model accuracy.

我们提出了一种适用于训练量化神经网络（QNNs）以避免在推断过程中使用低精度累加器时的溢出问题的新颖的权重量化方法 ——accumulator-aware quantization（A2Q）。A2Q 引入了一种受权重归一化启发的独特公式，根据我们导出的累加器比特宽度限制来约束模型权重的 L1 范数。因此，在训练低精度累加器的 QNNs 时，A2Q 还本质上促进了非结构化权重稀疏性以保证溢出避免。我们将该方法应用于基于深度学习的计算机视觉任务，以表明 A2Q 可以在保持与浮点基准相竞争的模型准确性的同时，训练适用于低精度累加器的 QNNs。在我们的评估中，我们考虑了 A2Q 对通用平台和可编程硬件的影响。然而，我们主要针对在 FPGAs 上部署模型，因为它们可以被编程以充分利用定制累加器比特宽度。我们的实验表明，累加器比特宽度显著影响基于 FPGA 的加速器的资源效率。在我们的基准测试中，A2Q 相比于 32 位累加器对应物平均提供高达 2.3 倍的资源利用率降低，同时保持 99.2% 的浮点模型准确性。