Exploration is a key problem in reinforcement learning, since agents can only
learn from data they acquire in the environment. With that in mind, maintaining
a population of agents is an attractive method, as it allows data be collected
with a diverse set of behaviors. This behavioral diversity is often boosted via
multi-objective loss functions. However, those approaches typically leverage
mean field updates based on pairwise distances, which makes them susceptible to
cycling behaviors and increased redundancy. In addition, explicitly boosting
diversity often has a detrimental impact on optimizing already fruitful
behaviors for rewards. As such, the reward-diversity trade off typically relies
on heuristics. Finally, such methods require behavioral representations, often
handcrafted and domain specific. In this paper, we introduce an approach to
optimize all members of a population simultaneously. Rather than using pairwise
distance, we measure the volume of the entire population in a behavioral
manifold, defined by task-agnostic behavioral embeddings. In addition, our
algorithm Diversity via Determinants (DvD), adapts the degree of diversity
during training using online learning techniques. We introduce both
evolutionary and gradient-based instantiations of DvD and show they effectively
improve exploration without reducing performance when better exploration is not
required.

本文介绍了一种基于行为多样性的优化方法，该方法使用任务不可知的行为嵌入度量整个人群的行为流形的体积，并通过在线学习技术适应多样性程度，从而提高探索能力，而不会降低性能。