Recent progress in Quality Diversity Reinforcement Learning (QD-RL) has
enabled learning a collection of behaviorally diverse, high performing
policies. However, these methods typically involve storing thousands of
policies, which results in high space-complexity and poor scaling to additional
behaviors. Condensing the archive into a single model while retaining the
performance and coverage of the original collection of policies has proved
challenging. In this work, we propose using diffusion models to distill the
archive into a single generative model over policy parameters. We show that our
method achieves a compression ratio of 13x while recovering 98% of the original
rewards and 89% of the original coverage. Further, the conditioning mechanism
of diffusion models allows for flexibly selecting and sequencing behaviors,
including using language. Project website:
this https URL

本研究提出使用扩散模型来压缩行为多样性强化学习（QD-RL）中成千上万个策略，将存档压缩到单个生成模型中，达到 13 倍的压缩比例，同时恢复 98% 的原始奖励和 89% 的覆盖率。

使用潜在扩散模型生成行为多样的策略

Generating Behaviorally Diverse Policies with Latent Diffusion Models

Training generally capable agents that perform well in unseen dynamic
environments is a long-term goal of robot learning. Quality Diversity
Reinforcement Learning (QD-RL) is an emerging class of reinforcement learning
(RL) algorithms that blend insights from Quality Diversity (QD) and RL to
produce a collection of high performing and behaviorally diverse policies with
respect to a behavioral embedding. Existing QD-RL approaches have thus far
taken advantage of sample-efficient off-policy RL algorithms. However, recent
advances in high-throughput, massively parallelized robotic simulators have
opened the door for algorithms that can take advantage of such parallelism, and
it is unclear how to scale existing off-policy QD-RL methods to these new
data-rich regimes. In this work, we take the first steps to combine on-policy
RL methods, specifically Proximal Policy Optimization (PPO), that can leverage
massive parallelism, with QD, and propose a new QD-RL method with these
high-throughput simulators and on-policy training in mind. Our proposed
Proximal Policy Gradient Arborescence (PPGA) algorithm yields a 4x improvement
over baselines on the challenging humanoid domain.

本文介绍如何使用高通量模拟器和在线学习方法相结合的 QD-RL 算法来训练能够在未知动态环境下表现良好的机器人，PPGA 算法在人形机器人领域实现了 4 倍的改进。