Layer-wise distillation is a powerful tool to compress large models (i.e. teacher models) into small ones (i.e., student models). The student distills knowledge from the teacher by mimicking the hidden representations of the teacher at every intermediate layer. However, layer-wise distillation is difficult. Since the student has a smaller model capacity than the teacher, it is often under-fitted. Furthermore, the hidden representations of the teacher contain redundant information that the student does not necessarily need for the target task's learning. To address these challenges, we propose a novel Task-aware layEr-wise Distillation (TED). TED designs task-aware filters to align the hidden representations of the student and the teacher at each layer. The filters select the knowledge that is useful for the target task from the hidden representations. As such, TED reduces the knowledge gap between the two models and helps the student to fit better on the target task. We evaluate TED in two scenarios: continual pre-training and fine-tuning. TED demonstrates significant and consistent improvements over existing distillation methods in both scenarios.

本研究提出一种名为TED的任务感知分层蒸馏方法，通过使用任务感知滤波器，选取有用于目标任务的知识来减小知识差距，从而在学生和教师之间减小知识差距并帮助学生更好地适应目标任务，在连续预训练和微调的两种情况下，TED都比现有的蒸馏方法表现出明显且一致的改进。

任务感知分层蒸馏：语言模型压缩的“减法即增益