knowledge distillation (KD) aims to transfer knowledge in a teacher-student
framework, by providing the predictions of the teacher network to the student
network in the training stage to help the student network generalize better. It
can use either a teacher with high capacity or {an}