Mixtures of Experts (MoEs) have gained prominence in (self-)supervised
learning due to their enhanced inference efficiency, adaptability to
distributed training, and modularity. Previous research has illustrated that
MoEs can significantly boost Deep Reinforcement Learning (DRL) performance by
expanding the network's parameter count while reducing dormant neurons, thereby
enhancing the model's learning capacity and ability to deal with
non-stationarity. In this work, we shed more light on MoEs' ability to deal
with non-stationarity and investigate MoEs in DRL settings with "amplified"
non-stationarity via multi-task training, providing further evidence that MoEs
improve learning capacity. In contrast to previous work, our multi-task results
allow us to better understand the underlying causes for the beneficial effect
of MoE in DRL training, the impact of the various MoE components, and insights
into how best to incorporate them in actor-critic-based DRL networks. Finally,
we also confirm results from previous work.

以增加参数计数、减少休眠神经元为特点的协同专家模型在深度强化学习中显著提升性能，通过多任务训练模拟非稳态性的扩大，进一步增强学习能力，并探索最佳使用戏剧评论理论网络中的协同专家组件的方法。

强化学习设置中的专家混合模型

Mixture of Experts in a Mixture of RL settings

This work introduces Variational Diffusion Distillation (VDD), a novel method
that distills denoising diffusion policies into Mixtures of Experts (MoE)
through variational inference. Diffusion Models are the current
state-of-the-art in generative modeling due to their exceptional ability to
accurately learn and represent complex, multi-modal distributions. This ability
allows Diffusion Models to replicate the inherent diversity in human behavior,
making them the preferred models in behavior learning such as Learning from
Human Demonstrations (LfD). However, diffusion models come with some drawbacks,
including the intractability of likelihoods and long inference times due to
their iterative sampling process. The inference times, in particular, pose a
significant challenge to real-time applications such as robot control. In
contrast, MoEs effectively address the aforementioned issues while retaining
the ability to represent complex distributions but are notoriously difficult to
train. VDD is the first method that distills pre-trained diffusion models into
MoE models, and hence, combines the expressiveness of Diffusion Models with the
benefits of Mixture Models. Specifically, VDD leverages a decompositional upper
bound of the variational objective that allows the training of each expert
separately, resulting in a robust optimization scheme for MoEs. VDD
demonstrates across nine complex behavior learning tasks, that it is able to:
i) accurately distill complex distributions learned by the diffusion model, ii)
outperform existing state-of-the-art distillation methods, and iii) surpass
conventional methods for training MoE.

Variational Diffusion Distillation (VDD) 是一种将预训练的扩散模型提取为混合专家模型 (MoE) 的方法，结合了扩散模型的表达能力和混合模型的优势，通过分解性上界的变分目标训练每个专家，从而在复杂行为学习任务中实现了对复杂分布的准确提取，超越了现有的蒸馏方法和传统的 MoE 训练方法。

扩散策略的变分蒸馏成为专家混合模型

Variational Distillation of Diffusion Policies into Mixture of Experts

In modern machine learning problems we deal with datasets that are either
distributed by nature or potentially large for which distributing the
computations is usually a standard way to proceed, since centralized algorithms
are in general ineffective. We propose a distributed learning approach for
mixtures of experts (MoE) models with an aggregation strategy to construct a
reduction estimator from local estimators fitted parallelly to distributed
subsets of the data. The aggregation is based on an optimal minimization of an
expected transportation divergence between the large MoE composed of local
estimators and the unknown desired MoE model. We show that the provided
reduction estimator is consistent as soon as the local estimators to be
aggregated are consistent, and its construction is performed by a proposed
majorization-minimization (MM) algorithm that is computationally effective. We
study the statistical and numerical properties for the proposed reduction
estimator on experiments that demonstrate its performance compared to namely
the global estimator constructed in a centralized way from the full dataset.
For some situations, the computation time is more than ten times faster, for a
comparable performance. Our source codes are publicly available on Github.

提出了一种分布式学习方法，用于构建一个由本地估计器并行拟合数据子集所组成的大的混合专家模型，通过最小化期望的运输散度来聚合这些本地估计器，并通过提出的主导 - 最小化算法来构造一种计算规模高效的降维估计器。对实验中的统计和数值属性进行了研究，证明了所提供的降维估计器的性能优于从完整数据集中以集中方式构造的全局估计器，有些情况下计算时间比全局估计器快十倍以上，我们的源代码公开在 Github 上。