knowledge distillation (KD) aims to learn a compact student network using
knowledge from a large pre-trained teacher network, where both networks are
trained on data from the same distribution. However, in practical applications,
the student network may be required to perform in a new