Current reinforcement learning (RL) algorithms can be brittle and difficult
to use, especially when learning goal-reaching behaviors from sparse rewards.
Although supervised imitation learning provides a simple and stable
alternative, it requires access to demonstrations from a human supervisor. In
this paper, we study RL algorithms that use imitation learning to acquire goal
reaching policies from scratch, without the need for expert demonstrations or a
value function. In lieu of demonstrations, we leverage the property that any
trajectory is a successful demonstration for reaching the final state in that
same trajectory. We propose a simple algorithm in which an agent continually
relabels and imitates the trajectories it generates to progressively learn
goal-reaching behaviors from scratch. Each iteration, the agent collects new
trajectories using the latest policy, and maximizes the likelihood of the
actions along these trajectories under the goal that was actually reached, so
as to improve the policy. We formally show that this iterated supervised
learning procedure optimizes a bound on the RL objective, derive performance
bounds of the learned policy, and empirically demonstrate improved
goal-reaching performance and robustness over current RL algorithms in several
benchmark tasks.

本文介绍了一种强化学习算法，利用模仿学习从零开始获得目标达成策略，而不需要专家演示或价值函数，并通过该算法在多个基准任务中达到了比现有强化学习算法更好的目标达成性能和鲁棒性。

通过迭代监督学习学习实现目标

Learning to Reach Goals via Iterated Supervised Learning

Autonomous agents that must exhibit flexible and broad capabilities will need
to be equipped with large repertoires of skills. Defining each skill with a
manually-designed reward function limits this repertoire and imposes a manual
engineering burden. Self-supervised agents that set their own goals can
automate this process, but designing appropriate goal setting objectives can be
difficult, and often involves heuristic design decisions. In this paper, we
propose a formal exploration objective for goal-reaching policies that
maximizes state coverage. We show that this objective is equivalent to
maximizing goal reaching performance together with the entropy of the goal
distribution, where goals correspond to full state observations. To instantiate
this principle, we present an algorithm called Skew-Fit for learning a
maximum-entropy goal distributions. We prove that, under regularity conditions,
Skew-Fit converges to a uniform distribution over the set of valid states, even
when we do not know this set beforehand. Our experiments show that combining
Skew-Fit for learning goal distributions with existing goal-reaching methods
outperforms a variety of prior methods on open-sourced visual goal-reaching
tasks. Moreover, we demonstrate that Skew-Fit enables a real-world robot to
learn to open a door, entirely from scratch, from pixels, and without any
manually-designed reward function.

本文提出一种形式化的目标探索目标，用于最大化状态覆盖，通过学习最大熵目标分布的 Skew-Fit 算法，与现有目标实现方法相结合，能够在开源的视觉目标达成任务中优于以前的方法，同时让真实世界中的机器人从像素开始、无需手动设计奖励函数，学会如何打开门。