Deep Neural Networks(DNNs) have many parameters and activation data, and
these both are expensive to implement. One method to reduce the size of the DNN
is to quantize the pre-trained model by using a low-bit expression for weights
and activations, using fine-tuning to recover the drop in accuracy. However, it
is generally difficult to train neural networks which use low-bit expressions.
One reason is that the weights in the middle layer of the DNN have a wide
dynamic range and so when quantizing the wide dynamic range into a few bits,
the step size becomes large, which leads to a large quantization error and
finally a large degradation in accuracy. To solve this problem, this paper
makes the following three contributions without using any additional learning
parameters and hyper-parameters. First, we analyze how batch normalization,
which causes the aforementioned problem, disturbs the fine-tuning of the
quantized DNN. Second, based on these results, we propose a new pruning method
called Pruning for Quantization (PfQ) which removes the filters that disturb
the fine-tuning of the DNN while not affecting the inferred result as far as
possible. Third, we propose a workflow of fine-tuning for quantized DNNs using
the proposed pruning method(PfQ). Experiments using well-known models and
datasets confirmed that the proposed method achieves higher performance with a
similar model size than conventional quantization methods including
fine-tuning.

该研究提出了一种减小深度神经网络 (DNNs) 体积的方法 —— 使用低位表达来量化预训练模型的权重和激活数据，并提出了基于剪枝的新方法 PfQ 来解决深度中间层权重动态范围宽导致的量化误差和准确度下降的问题。