Supervised regression to demonstrations has been demonstrated to be a stable
way to train deep policy networks. We are motivated to study how we can take
full advantage of supervised loss functions for stably training deep
reinforcement learning agents. This is a challenging task because it is unclear
how the training data could be collected to enable policy improvement. In this
work, we propose Self-Supervised Reinforcement Learning (SSRL), a simple
algorithm that optimizes policies with purely supervised losses. We demonstrate
that, without policy gradient or value estimation, an iterative procedure of
``labeling" data and supervised regression is sufficient to drive stable policy
improvement. By selecting and imitating trajectories with high episodic
rewards, SSRL is surprisingly competitive to contemporary algorithms with more
stable performance and less running time, showing the potential of solving
reinforcement learning with supervised learning techniques. The code is
available at this https URL

通过自监督回归学习策略网络，提出了一种基于监督损失函数训练深度强化学习智能体的算法 (SSRL)，该算法无需策略梯度或价值估计，能够通过监督回归数据来稳定提高策略表现并在效率和性能方面与现有算法相媲美，展示了利用监督学习技术解决强化学习问题的潜力。

自监督简化深度强化学习

Simplifying Deep Reinforcement Learning via Self-Supervision

The success of popular algorithms for deep reinforcement learning, such as
policy-gradients and Q-learning, relies heavily on the availability of an
informative reward signal at each timestep of the sequential decision-making
process. When rewards are only sparsely available during an episode, or a
rewarding feedback is provided only after episode termination, these algorithms
perform sub-optimally due to the difficultly in credit assignment.
Alternatively, trajectory-based policy optimization methods, such as
cross-entropy method and evolution strategies, do not require per-timestep
rewards, but have been found to suffer from high sample complexity by
completing forgoing the temporal nature of the problem. Improving the
efficiency of RL algorithms in real-world problems with sparse or episodic
rewards is therefore a pressing need. In this work, we introduce a
self-imitation learning algorithm that exploits and explores well in the sparse
and episodic reward settings. We view each policy as a state-action visitation
distribution and formulate policy optimization as a divergence minimization
problem. We show that with Jensen-Shannon divergence, this divergence
minimization problem can be reduced into a policy-gradient algorithm with
shaped rewards learned from experience replays. Experimental results indicate
that our algorithm works comparable to existing algorithms in environments with
dense rewards, and significantly better in environments with sparse and
episodic rewards. We then discuss limitations of self-imitation learning, and
propose to solve them by using Stein variational policy gradient descent with
the Jensen-Shannon kernel to learn multiple diverse policies. We demonstrate
its effectiveness on a challenging variant of continuous-control MuJoCo
locomotion tasks.

本文提出了一种基于自我模仿学习的深度强化学习算法，旨在优化在稀疏和情景化奖励设置下的 RL 算法的效率，并使用 Stein 变分策略梯度下降来解决自我模仿学习的局限性，并在连续控制 MuJoCo 运动任务的一个具有挑战性的变体上展示了其有效性。