In this paper, we present a thorough evaluation of the efficacy of knowledge
distillation and its dependence on student and teacher architectures. Starting
with the observation that more accurate teachers often don't make good
teachers, we attempt to tease apart the factors that affect knowledge
distillation performance. We find crucially that larger models