The framework of Simulation-to-real learning, i.e, learning policies in
simulation and transferring those policies to the real world is one of the most
promising approaches towards data-efficient learning in robotics. However, due
to the inevitable reality gap between the simulation and the real world, a
policy learned in the simulation may not always generate a safe behaviour on
the real robot. As a result, during adaptation of the policy in the real world,
the robot may damage itself or cause harm to its surroundings. In this work, we
introduce a novel learning algorithm called SafeAPT that leverages a diverse
repertoire of policies evolved in the simulation and transfers the most
promising safe policy to the real robot through episodic interaction. To
achieve this, SafeAPT iteratively learns a probabilistic reward model as well
as a safety model using real-world observations combined with simulated
experiences as priors. Then, it performs Bayesian optimization on the
repertoire with the reward model while maintaining the specified safety
constraint using the safety model. SafeAPT allows a robot to adapt to a wide
range of goals safely with the same repertoire of policies evolved in the
simulation. We compare SafeAPT with several baselines, both in simulated and
real robotic experiments and show that SafeAPT finds high-performance policies
within a few minutes in the real world while minimizing safety violations
during the interactions.

介绍了一种名为 SafeAPT 的学习算法，它使用仿真实验学习的策略并将其安全地迁移到实际机器人中，通过实际交互而不会危害自身或周围环境。该算法通过迭代学习概率奖励模型和安全模型，并使用仿真经验作为先验，在满足安全约束的情况下完成该策略。经过实际和仿真机器人的对比实验，显示 SafeAPT 能够在短时间内找到高性能策略并在交互期间最小化安全违规。