The primary goal of knowledge distillation (KD) is to encapsulate the
information of a model learned from a teacher network into a student network,
with the latter being more compact than the former. Existing work, e.g., using
Kullback-Leibler divergence for distillation, may fail to c