We investigate the effects of post-training quantization and
quantization-aware training on the generalization of Transformer language
models. We present a new method called self-distilled quantization (SDQ) that
minimizes accumulative quantization errors and outperforms baselines. We apply
SDQ to multilingual models XLM-R-Base and InfoXLM-Base and demonstrate that
both models can be reduced from 32-bit floating point weights to 8-bit integer
weights while maintaining a high level of performance on the XGLUE benchmark.
Our results also highlight the challenges of quantizing multilingual models,
which must generalize to languages they were not fine-tuned on.

通过后训练量化和量化意识训练来研究 Transformer 语言模型的概括化效果。提出了一种称为自身蒸馏量化（SDQ）的方法，该方法最小化积累的量化误差，并优于基线。将 SDQ 应用于多语言模型 XLM-R-Base 和 InfoXLM-Base，并证明两个模型可以从 32 位浮点权重减少到 8 位整数权重，同时在 XGLUE 基准上保持高水平的性能。我们的结果还突出了量化多语言模型的挑战，这些模型必须概括他们没有针对性微调的语言。