BriefGPT.xyz
Oct, 2022
任务感知分层蒸馏:语言模型压缩的“减法即增益
Less is More: Task-aware Layer-wise Distillation for Language Model Compression
HTML
PDF
Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen...
TL;DR
本研究提出一种名为TED的任务感知分层蒸馏方法,通过使用任务感知滤波器,选取有用于目标任务的知识来减小知识差距,从而在学生和教师之间减小知识差距并帮助学生更好地适应目标任务,在连续预训练和微调的两种情况下,TED都比现有的蒸馏方法表现出明显且一致的改进。
Abstract
layer-wise distillation
is a powerful tool to compress large models (i.e.
teacher models
) into small ones (i.e.,
student models
). The stud
→