Pretraining transformers are generally time-consuming. Fully quantized
training (FQT) is a promising approach to speed up pretraining. However, most
FQT methods adopt a quantize-compute-dequantize procedure, which often leads to
suboptimal speedup and significant performance degradation when used in
transformers due to the high memory access overheads and low-precision
computations. In this work, we propose Jetfire, an efficient and accurate INT8
training method specific to transformers. Our method features an INT8 data flow
to optimize memory access and a per-block quantization method to maintain the
accuracy of pretrained transformers. Extensive experiments demonstrate that our
INT8 FQT method achieves comparable accuracy to the FP16 training baseline and
outperforms the existing INT8 training works for transformers. Moreover, for a
standard transformer block, our method offers an end-to-end training speedup of
1.42x and a 1.49x memory reduction compared to the FP16 baseline.

Jetfire 提出了一种高效准确的 INT8 预训练方法，通过 INT8 数据流优化内存访问和每个块的量化方法来实现与 FP16 基线相当的准确性，且相对于 FP16 基线，提供了 1.42 倍的训练加速和 1.49 倍的内存减少。

Jetfire：使用 INT8 数据流和每块量化实现高效准确的 Transformer 预训练

Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data  Flow and Per-Block Quantization

Fully quantized training (FQT), which uses low-bitwidth hardware by
quantizing the activations, weights, and gradients of a neural network model,
is a promising approach to accelerate the training of deep neural networks. One
major challenge with FQT is the lack of theoretical understanding, in
particular of how gradient quantization impacts convergence properties. In this
paper, we address this problem by presenting a statistical framework for
analyzing FQT algorithms. We view the quantized gradient of FQT as a stochastic
estimator of its full precision counterpart, a procedure known as
quantization-aware training (QAT). We show that the FQT gradient is an unbiased
estimator of the QAT gradient, and we discuss the impact of gradient
quantization on its variance. Inspired by these theoretical results, we develop
two novel gradient quantizers, and we show that these have smaller variance
than the existing per-tensor quantizer. For training ResNet-50 on ImageNet, our
5-bit block Householder quantizer achieves only 0.5% validation accuracy loss
relative to QAT, comparable to the existing INT8 baseline.

本论文提出了一个用于分析全量化训练算法的统计框架，并探讨了梯度量化对其收敛性的影响。作者开发了两个新的梯度量化器，并展示了这些量化器相对于现有的每个张量量化器具有更小的方差。