Reinforcement learning (RL) often faces the challenges of uninformed search
problems where the agent should explore without access to the domain knowledge
such as characteristics of the environment or external rewards. To tackle these
challenges, this work proposes a new approach for curriculum RL called
Diversify for Disagreement & Conquer (D2C). Unlike previous curriculum learning
methods, D2C requires only a few examples of desired outcomes and works in any
environment, regardless of its geometry or the distribution of the desired
outcome examples. The proposed method performs diversification of the
goal-conditional classifiers to identify similarities between visited and
desired outcome states and ensures that the classifiers disagree on states from
out-of-distribution, which enables quantifying the unexplored region and
designing an arbitrary goal-conditioned intrinsic reward signal in a simple and
intuitive way. The proposed method then employs bipartite matching to define a
curriculum learning objective that produces a sequence of well-adjusted
intermediate goals, which enable the agent to automatically explore and conquer
the unexplored region. We present experimental results demonstrating that D2C
outperforms prior curriculum RL methods in both quantitative and qualitative
aspects, even with the arbitrarily distributed desired outcome examples.

提出了一种新的课程强化学习方法 D2C，该方法通过对目标条件分类器进行多样化，确保分类器对于来自分布之外的状态产生分歧，从而探索未知区域并定义一种任意目标条件内在奖励信号，从而产生适应性良好的中间目标序列，以自动探索并征服未知区域。实验结果表明，D2C 在定量和定性方面均优于之前的课程强化学习方法。

聚变多样性：基于结果导向的课程强化学习与分歧的超出分布

Diversify & Conquer: Outcome-directed Curriculum RL via  Out-of-Distribution Disagreement

Current reinforcement learning (RL) often suffers when solving a challenging
exploration problem where the desired outcomes or high rewards are rarely
observed. Even though curriculum RL, a framework that solves complex tasks by
proposing a sequence of surrogate tasks, shows reasonable results, most of the
previous works still have difficulty in proposing curriculum due to the absence
of a mechanism for obtaining calibrated guidance to the desired outcome state
without any prior domain knowledge. To alleviate it, we propose an uncertainty
& temporal distance-aware curriculum goal generation method for the
outcome-directed RL via solving a bipartite matching problem. It could not only
provide precisely calibrated guidance of the curriculum to the desired outcome
states but also bring much better sample efficiency and geometry-agnostic
curriculum goal proposal capability compared to previous curriculum RL methods.
We demonstrate that our algorithm significantly outperforms these prior methods
in a variety of challenging navigation tasks and robotic manipulation tasks in
a quantitative and qualitative way.

本文提出了一种针对增强学习的不确定性和时间距离感知课程目标生成方法，通过解决二分图匹配问题，为课程提供精确的指导，从而更好地解决了先前课程 RL 方法中存在的问题，并在数量和质量上显著优于这些方法。