Simulation-to-Reality Reinforcement Learning (Sim-to-Real RL) seeks to use
simulations to minimize the need for extensive real-world interactions.
Specifically, in the few-shot off-dynamics setting, the goal is to acquire a
simulator-based policy despite a dynamics mismatch that can be effectively
transferred to the real-world using only a handful of real-world transitions.
In this context, conventional RL agents tend to exploit simulation inaccuracies
resulting in policies that excel in the simulator but underperform in the real
environment. To address this challenge, we introduce a novel approach that
incorporates a penalty to constrain the trajectories induced by the
simulator-trained policy inspired by recent advances in Imitation Learning and
Trust Region based RL algorithms. We evaluate our method across various
environments representing diverse Sim-to-Real conditions, where access to the
real environment is extremely limited. These experiments include
high-dimensional systems relevant to real-world applications. Across most
tested scenarios, our proposed method demonstrates performance improvements
compared to existing baselines.

使用仿真技术最小化对真实世界交互的需求，在少样本离线动力学设置下，引入了一种新方法，通过惩罚来限制仿真训练策略引发的轨迹，以解决常规强化学习代理倾向于利用仿真不准确性的挑战。在各种环境中评估了我们的方法，包括代表不同仿真到真实条件的高维系统，并且在大多数测试场景中，我们的方法相比现有基线模型表现出改进。