Knowledge distillation, a widely used model compression technique, works on
the basis of transferring knowledge from a cumbersome teacher model to a
lightweight student model. The technique involves jointly optimizing the task
specific and knowledge distillation losses with a weight assigned to them.
Despite these weights playing a crucial role in the performance of the
distillation process, current methods provide equal weight to both losses,
leading to suboptimal performance. In this paper, we propose Adaptive Knowledge
Distillation, a novel technique inspired by curriculum learning to adaptively
weigh the losses at instance level. This technique goes by the notion that
sample difficulty increases with teacher loss. Our method follows a
plug-and-play paradigm that can be applied on top of any task-specific and
distillation objectives. Experiments show that our method performs better than
conventional knowledge distillation method and existing instance-level loss
functions.

本文提出了一种自适应知识蒸馏技术，通过课程学习的启发，以实例级别自适应地加权损失，并实验证明该方法优于传统的知识蒸馏方法和现有的实例级别损失函数。

AdaKD：使用自适应损失加权的 ASR 模型动态知识蒸馏

AdaKD: Dynamic Knowledge Distillation of ASR models using Adaptive Loss  Weighting

Federated learning enables multiple clients to collaboratively learn a global
model by periodically aggregating the clients' models without transferring the
local data. However, due to the heterogeneity of the system and data, many
approaches suffer from the "client-drift" issue that could significantly slow
down the convergence of the global model training. As clients perform local
updates on heterogeneous data through heterogeneous systems, their local models
drift apart. To tackle this issue, one intuitive idea is to guide the local
model training by the global teachers, i.e., past global models, where each
client learns the global knowledge from past global models via adaptive
knowledge distillation techniques. Coming from these insights, we propose a
novel approach for heterogeneous federated learning, namely FedGKD, which fuses
the knowledge from historical global models for local training to alleviate the
"client-drift" issue. In this paper, we evaluate FedGKD with extensive
experiments on various CV/NLP datasets (i.e., CIFAR-10/100, Tiny-ImageNet, AG
News, SST5) and different heterogeneous settings. The proposed method is
guaranteed to converge under common assumptions, and achieves superior
empirical accuracy in fewer communication runs than five state-of-the-art
methods.

该论文提出一种名为 FedGKD 的新方法，通过融合历史全局模型的知识进行本地训练，解决异构联邦学习中的客户端漂移问题，并在各种计算机视觉和自然语言处理数据集上进行广泛的实验和评估，达到了优于其他五种方法的结果。