BriefGPT.xyz
Feb, 2024
自回归语言模型的知识蒸馏再探讨
Revisiting Knowledge Distillation for Autoregressive Language Models
HTML
PDF
Qihuang Zhong, Liang Ding, Li Shen, Juhua Liu, Bo Du...
TL;DR
通过分析发现大型语言模型在教学学生模型时会导致性能下降,设计了一种自适应教学方法(ATKD)来改进知识蒸馏,并通过大量实验验证其在各种模型类型和规模上均能显著提高性能(平均得分增加至多+3.04%)。更重要的是,ATKD能有效改善学生模型的泛化能力。
Abstract
knowledge distillation
(KD) is a common approach to compress a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, in the context of
autoregressive language mod
→