Learning from examples of success is an appealing approach to reinforcement
learning that eliminates many of the disadvantages of using hand-crafted reward
functions or full expert-demonstration trajectories, both of which can be
difficult to acquire, biased, or suboptimal. However, learning from examples
alone dramatically increases the exploration challenge, especially for complex
tasks. This work introduces value-penalized auxiliary control from examples
(VPACE); we significantly improve exploration in example-based control by
adding scheduled auxiliary control and examples of auxiliary tasks.
Furthermore, we identify a value-calibration problem, where policy value
estimates can exceed their theoretical limits based on successful data. We
resolve this problem, which is exacerbated by learning auxiliary tasks, through
the addition of an above-success-level value penalty. Across three simulated
and one real robotic manipulation environment, and 21 different main tasks, we
show that our approach substantially improves learning efficiency. Videos,
code, and datasets are available at this https URL

通过添加计划的辅助控制和辅助任务的示例，本研究在基于示例的控制任务中显著提高了探索能力，并解决了价值估计超出理论限制的问题，从而大大提高了学习效率。

示例中的价值惩罚辅助控制用于无奖励或演示的学习

Value-Penalized Auxiliary Control from Examples for Learning without  Rewards or Demonstrations

Task automation of surgical robot has the potentials to improve surgical
efficiency. Recent reinforcement learning (RL) based approaches provide
scalable solutions to surgical automation, but typically require extensive data
collection to solve a task if no prior knowledge is given. This issue is known
as the exploration challenge, which can be alleviated by providing expert
demonstrations to an RL agent. Yet, how to make effective use of demonstration
data to improve exploration efficiency still remains an open challenge. In this
work, we introduce Demonstration-guided EXploration (DEX), an efficient
reinforcement learning algorithm that aims to overcome the exploration problem
with expert demonstrations for surgical automation. To effectively exploit
demonstrations, our method estimates expert-like behaviors with higher values
to facilitate productive interactions, and adopts non-parametric regression to
enable such guidance at states unobserved in demonstration data. Extensive
experiments on $10$ surgical manipulation tasks from SurRoL, a comprehensive
surgical simulation platform, demonstrate significant improvements in the
exploration efficiency and task success rates of our method. Moreover, we also
deploy the learned policies to the da Vinci Research Kit (dVRK) platform to
show the effectiveness on the real robot. Code is available at
this https URL.

本文介绍了一种基于强化学习的手术自动化算法，使用专家演示数据来提高任务探索效率并克服探索挑战。实验显示该算法在 $10$ 个手术操作任务中取得了显著的提高，并在实际机器人上展示了有效性。