Federated Averaging, and many federated learning algorithm variants which build upon it, have a limitation: all clients must share the same model architecture. This results in unused modeling capacity on many clients, which limits model performance. To address this issue, we propose a method that involves training a small model on the entire pool and a larger model on a subset of clients with higher capacity. The models exchange information bidirectionally via knowledge distillation, utilizing an unlabeled dataset on a server without sharing parameters. We present two variants of our method, which improve upon federated averaging on image classification and language modeling tasks. We show this technique can be useful even if only out-of-domain or limited in-domain distillation data is available. Additionally, the bi-directional knowledge distillation allows for domain transfer between the models when different pool populations introduce domain shift.

通过使用双向知识蒸馏方法，在具有不同性能的一部分客户端上训练较大的模型和整体池上训练较小的模型，实现不同领域之间的模型域转移，从而提高联邦平均算法的性能。该方法在图像分类和语言建模任务中表现出改进的效果，即使只有领域外或领域内有限的蒸馏数据可用。

基于知识蒸馏的异构联邦学习