knowledge distillation transfers knowledge from the teacher network to the
student one, with the goal of greatly improving the performance of the student
network. Previous methods mostly focus on proposing feature transformation and
loss functions between the same level's features to i