The application of mixture-of-experts (MoE) is gaining popularity due to its
ability to improve model's performance. In an MoE structure, the gate layer
plays a significant role in distinguishing and routing input features to
different experts. This enables each expert to specialize in processing their
corresponding sub-tasks. However, the gate's routing mechanism also gives rise
to narrow vision: the individual MoE's expert fails to use more samples in
learning the allocated sub-task, which in turn limits the MoE to further
improve its generalization ability. To effectively address this, we propose a
method called Mixture-of-Distilled-Expert (MoDE), which applies moderate mutual
distillation among experts to enable each expert to pick up more features
learned by other experts and gain more accurate perceptions on their original
allocated sub-tasks. We conduct plenty experiments including tabular, NLP and
CV datasets, which shows MoDE's effectiveness, universality and robustness.
Furthermore, we develop a parallel study through innovatively constructing
"expert probing", to experimentally prove why MoDE works: moderate distilling
knowledge can improve each individual expert's test performances on their
assigned tasks, leading to MoE's overall performance improvement.

我们提出了一种叫作 Mixture-of-Distilled-Expert（MoDE）的方法，通过在专家之间应用适度的相互蒸馏，使每个专家能够掌握其他专家学习到的更多特征，从而对其原始分配的子任务有更准确的认识。我们进行了大量的实验，包括表格、自然语言处理和计算机视觉数据集，证明了 MoDE 的有效性、普适性和鲁棒性。此外，我们通过创新地构建 “专家探针” 进行了平行研究，实验性地证明了为什么 MoDE 起作用：适度的知识蒸馏可以提高每个单独专家在其所分配任务上的测试性能，从而提升 MoE 的整体性能。

MoDE: 一种基于专家互相融合的混合模型

MoDE: A Mixture-of-Experts Model with Mutual Distillation among the  Experts

Recent progresses on self-supervised 3D human action representation learning
are largely attributed to contrastive learning. However, in conventional
contrastive frameworks, the rich complementarity between different skeleton
modalities remains under-explored. Moreover, optimized with distinguishing
self-augmented samples, models struggle with numerous similar positive
instances in the case of limited action categories. In this work, we tackle the
aforementioned problems by introducing a general Inter- and Intra-modal Mutual
Distillation (I$^2$MD) framework. In I$^2$MD, we first re-formulate the
cross-modal interaction as a Cross-modal Mutual Distillation (CMD) process.
Different from existing distillation solutions that transfer the knowledge of a
pre-trained and fixed teacher to the student, in CMD, the knowledge is
continuously updated and bidirectionally distilled between modalities during
pre-training. To alleviate the interference of similar samples and exploit
their underlying contexts, we further design the Intra-modal Mutual
Distillation (IMD) strategy, In IMD, the Dynamic Neighbors Aggregation (DNA)
mechanism is first introduced, where an additional cluster-level discrimination
branch is instantiated in each modality. It adaptively aggregates
highly-correlated neighboring features, forming local cluster-level
contrasting. Mutual distillation is then performed between the two branches for
cross-level knowledge exchange. Extensive experiments on three datasets show
that our approach sets a series of new records.

介绍了一种新的互模态和内模态的相互蒸馏框架，通过改进跨模态交互和解决相似样本干扰问题，实现了自监督学习，在三个数据集上取得了新的记录。

I$^2$MD：带有内外模态相互蒸馏的三维动作表征学习

I$^2$MD: 3D Action Representation Learning with Inter- and Intra-modal  Mutual Distillation

Federated learning (FL) enables distributed participants to collectively
learn a strong global model without sacrificing their individual data privacy.
Mainstream FL approaches require each participant to share a common network
architecture and further assume that data are are sampled IID across
participants. However, in real-world deployments participants may require
heterogeneous network architectures; and the data distribution is almost
certainly non-uniform across participants. To address these issues we introduce
FedH2L, which is agnostic to both the model architecture and robust to
different data distributions across participants. In contrast to approaches
sharing parameters or gradients, FedH2L relies on mutual distillation,
exchanging only posteriors on a shared seed set between participants in a
decentralized manner. This makes it extremely bandwidth efficient, model
agnostic, and crucially produces models capable of performing well on the whole
data distribution when learning from heterogeneous silos.

该论文提出了 FedH2L 方法，通过相互蒸馏和分散学习的方式来实现联邦学习中不同参与者拥有不同网络结构和数据分布的情况下训练一个强而全面的全局模型。