BriefGPT.xyz
Feb, 2024
Sinkhorn距离最小化的知识蒸馏
Sinkhorn Distance Minimization for Knowledge Distillation
HTML
PDF
Xiao Cui, Yulei Qin, Yuting Gao, Enwei Zhang, Zihan Xu...
TL;DR
通过Sinkhorn知识蒸馏方法,克服了传统教师-学生模式中Kullback-Leibler散度的模型坍缩、反Kullback-Leibler散度的模型平均以及Jensen-Shannon散度的模型低估等问题,有效地压缩大型语言模型并在多样的自然语言处理任务中取得了优越性能。
Abstract
knowledge distillation
(KD) has been widely adopted to compress large
language models
(LLMs). Existing KD methods investigate various
divergence
→