Training robots for operation in the real world is a complex, time consuming
and potentially expensive task. Despite significant success of reinforcement
learning in games and simulations, research in real robot applications has not
been able to match similar progress. While sample complexity can be reduced by
training policies in simulation, such policies can perform sub-optimally on the
real platform given imperfect calibration of model dynamics. We present an
approach -- supplemental to fine tuning on the real robot -- to further benefit
from parallel access to a simulator during training and reduce sample
requirements on the real robot. The developed approach harnesses auxiliary
rewards to guide the exploration for the real world agent based on the
proficiency of the agent in simulation and vice versa. In this context, we
demonstrate empirically that the reciprocal alignment for both agents provides
further benefit as the agent in simulation can adjust to optimize its behaviour
for states commonly visited by the real-world agent.

通过强化学习在模拟环境中训练机器人并结合补充奖励策略，与真实机器人进行进一步的微调来优化探索策略，实验结果表明，这种相互对齐的方法可以在真实和模拟环境中实现更好的性能。