Recent algorithms designed for reinforcement learning tasks focus on finding
a single optimal solution. However, in many practical applications, it is
important to develop reasonable agents with diverse strategies. In this paper,
we propose Diversity-Guided Policy Optimization (DGPO), an on-policy framework
for discovering multiple strategies for the same task. Our algorithm uses
diversity objectives to guide a latent code conditioned policy to learn a set
of diverse strategies in a single training procedure. Specifically, we
formalize our algorithm as the combination of a diversity-constrained
optimization problem and an extrinsic-reward constrained optimization problem.
And we solve the constrained optimization as a probabilistic inference task and
use policy iteration to maximize the derived lower bound. Experimental results
show that our method efficiently finds diverse strategies in a wide variety of
reinforcement learning tasks. We further show that DGPO achieves a higher
diversity score and has similar sample complexity and performance compared to
other baselines.

本文提出了一种基于多样性导向的动态规划策略优化算法（DGPO），该算法使用多样性对象来指导一个隐式编码策略，从而在单一的训练过程中学习出多组不同的策略，并将受外部激励约束的优化问题作为概率推理任务来解决，并使用策略迭代来最大化所得的下界。实验结果表明，该方法在各种强化学习任务中有效地找到了多样化的策略，并且与其他基线模型相比具有更高的多样性得分和相似的样本复杂度和性能。