Knowledge distillation (KD), transferring knowledge from a cumbersome teacher model to a lightweight student model, has been investigated to design efficient neural architectures. Generally, the objective function of KD is the Kullback-Leibler (KL) divergence loss between the softened probability distributions of the teacher model and the student model with the temperature scaling hyperparameter tau. Despite its widespread use, few studies have discussed the influence of such softening on generalization. Here, we theoretically show that the KL divergence loss focuses on the logit matching when tau increases and the label matching when tau goes to 0 and empirically show that the logit matching is positively correlated to performance improvement in general. From this observation, we consider an intuitive KD loss function, the mean squared error (MSE) between the logit vectors, so that the student model can directly learn the logit of the teacher model. The MSE loss outperforms the KL divergence loss, explained by the difference in the penultimate layer representations between the two losses. Furthermore, we show that sequential distillation can improve performance and that KD, particularly when using the KL divergence loss with small tau, mitigates the label noise. The code to reproduce the experiments is publicly available online at https://github.com/jhoon-oh/kd_data/.

研究知识蒸馏的目标函数KL散度损失在温度参数变大时侧重于logit匹配，而在温度参数趋近于0时侧重于标签匹配，提出使用均方误差作为损失函数，学生模型直接学习老师模型的logit向量。该方法优于KL散度损失，并可以改善标签噪声，通过实验证明了知识蒸馏的有效性。

知识蒸馏中的Kullback-Leibler Divergence和Mean Squared Error Loss的比较