knowledge distillation (KD) has proved to be an effective approach for deep
neural network compression, which learns a compact network (student) by
transferring the knowledge from a pre-trained, over-parameterized network
(teacher). In traditional KD, the transferred knowledge is usual