Ideally, we would place a robot in a real-world environment and leave it
there improving on its own by gathering more experience autonomously. However,
algorithms for autonomous robotic learning have been challenging to realize in
the real world. While this has often been attributed to the challenge of sample
complexity, even sample-efficient techniques are hampered by two major
challenges - the difficulty of providing well "shaped" rewards, and the
difficulty of continual reset-free training. In this work, we describe a system
for real-world reinforcement learning that enables agents to show continual
improvement by training directly in the real world without requiring
painstaking effort to hand-design reward functions or reset mechanisms. Our
system leverages occasional non-expert human-in-the-loop feedback from remote
users to learn informative distance functions to guide exploration while
leveraging a simple self-supervised learning algorithm for goal-directed policy
learning. We show that in the absence of resets, it is particularly important
to account for the current "reachability" of the exploration policy when
deciding which regions of the space to explore. Based on this insight, we
instantiate a practical learning system - GEAR, which enables robots to simply
be placed in real-world environments and left to train autonomously without
interruption. The system streams robot experience to a web interface only
requiring occasional asynchronous feedback from remote, crowdsourced,
non-expert humans in the form of binary comparative feedback. We evaluate this
system on a suite of robotic tasks in simulation and demonstrate its
effectiveness at learning behaviors both in simulation and the real world.
Project website this https URL

实现自主学习的算法对于在真实环境中的机器人来说一直是个挑战，但本研究描述了一个实际的强化学习系统，通过在真实环境中进行训练，并借助人类的反馈来实现不间断的改进。该系统在不需要设计奖励函数或重置机制的情况下，通过自我监督学习算法和人类反馈产生的信息来指导探索和筛选学习策略。在模拟环境和真实世界中的机器人任务上的评估结果表明，该系统能够有效地学习行为。

异步人类反馈下的自主机器人强化学习

Autonomous Robotic Reinforcement Learning with Asynchronous Human  Feedback

An oft-ignored challenge of real-world reinforcement learning is that the
real world does not pause when agents make learning updates. As standard
simulated environments do not address this real-time aspect of learning, most
available implementations of RL algorithms process environment interactions and
learning updates sequentially. As a consequence, when such implementations are
deployed in the real world, they may make decisions based on significantly
delayed observations and not act responsively. Asynchronous learning has been
proposed to solve this issue, but no systematic comparison between sequential
and asynchronous reinforcement learning was conducted using real-world
environments. In this work, we set up two vision-based tasks with a robotic
arm, implement an asynchronous learning system that extends a previous
architecture, and compare sequential and asynchronous reinforcement learning
across different action cycle times, sensory data dimensions, and mini-batch
sizes. Our experiments show that when the time cost of learning updates
increases, the action cycle time in sequential implementation could grow
excessively long, while the asynchronous implementation can always maintain an
appropriate action cycle time. Consequently, when learning updates are
expensive, the performance of sequential learning diminishes and is
outperformed by asynchronous learning by a substantial margin. Our system
learns in real-time to reach and track visual targets from pixels within two
hours of experience and does so directly using real robots, learning completely
from scratch.

本文论述了异步学习和顺序学习的比较，并在真实环境下使用机器人手臂和视觉任务进行了实验。研究结果表明，当学习更新的时间成本增加时，顺序学习的性能会显著下降，而异步学习会明显胜过顺序学习。