Knowledge distillation provides an effective method for deploying complex machine learning models in resource-constrained environments. It typically involves training a smaller student model to emulate either the probabilistic outputs or the internal feature representations of a larger teacher model. By doing so, the student model often achieves substantially better performance on a downstream task compared to when it is trained independently. Nevertheless, the teacher's internal representations can also encode noise or additional information that may not be relevant to the downstream task. This observation motivates our primary question: What are the information-theoretic limits of knowledge transfer? To this end, we leverage a body of work in information theory called Partial Information Decomposition (PID) to quantify the distillable and distilled knowledge of a teacher's representation corresponding to a given student and a downstream task. Moreover, we demonstrate that this metric can be practically used in distillation to address challenges caused by the complexity gap between the teacher and the student representations.

本研究解决了知识蒸馏过程中信息转移的理论极限问题。通过引入部分信息分解的方法，量化了教师模型表示中可蒸馏与已蒸馏知识的量，并证明该指标可以有效应用于蒸馏过程中，缓解教师与学生模型表示之间的复杂性差距。这为资源受限环境下机器学习模型的部署提供了新的视角和解决方案。

使用部分信息分解量化知识蒸馏