Recently, deep learning-based models have been widely studied for click-through rate (CTR) prediction and lead to improved prediction accuracy in many industrial applications. However, current research focuses primarily on building complex network architectures to better capture sophisticated feature interactions and dynamic user behaviors. The increased model complexity may slow down online inference and hinder its adoption in real-time applications. Instead, our work targets at a new model training strategy based on knowledge distillation (KD). KD is a teacher-student learning framework to transfer knowledge learned from a teacher model to a student model. The KD strategy not only allows us to simplify the student model as a vanilla DNN model but also achieves significant accuracy improvements over the state-of-the-art teacher models. The benefits thus motivate us to further explore the use of a powerful ensemble of teachers for more accurate student model training. We also propose some novel techniques to facilitate ensembled CTR prediction, including teacher gating and early stopping by distillation loss. We conduct comprehensive experiments against 12 existing models and across three industrial datasets. Both offline and online A/B testing results show the effectiveness of our KD-based training strategy.

本论文提出一种基于知识蒸馏（KD）的模型训练策略，通过将教师模型学到的知识传输给学生模型，简化深度神经网络（DNN）学生模型并实现了显著的精度提高，使用多个教师模型进行训练进一步提高了学生模型的准确性。包括教师门控以及蒸馏损失提前停止等创新方法在内的多个实验表明了基于知识蒸馏的训练策略的有效性。

知识蒸馏下的CTR预测集成