Enhancing small language models for real-life application deployment is a significant challenge facing the research community. Due to the difficulties and costs of using large language models, researchers are seeking ways to effectively deploy task-specific small models. In this work, we introduce a simple yet effective knowledge distillation method to improve the performance of small language models. Our approach utilizes a teacher model with approximately 3 billion parameters to identify the most influential tokens in its decision-making process. These tokens are extracted from the input based on their attribution scores relative to the output, using methods like saliency maps. These important tokens are then provided as rationales to a student model, aiming to distill the knowledge of the teacher model. This method has proven to be effective, as demonstrated by testing it on four diverse datasets, where it shows improvement over both standard fine-tuning methods and state-of-the-art knowledge distillation models. Furthermore, we explore explanations of the success of the model by analyzing the important tokens extracted from the teacher model. Our findings reveal that in 68\% of cases, specifically in datasets where labels are part of the answer, such as multiple-choice questions, the extracted tokens are part of the ground truth.

本研究解决了小型语言模型在实际应用中面临的性能提升问题。我们提出了一种简单有效的知识蒸馏方法，通过分析教师模型的重要令牌，帮助学生模型更好地学习，从而显著提高了小型模型的性能，尤其在含有标签的多项选择题数据集上，68%的情况下提取的令牌是答案的组成部分。

高效的知识蒸馏：利用教师模型洞察力增强小型语言模型