Training robots for operation in the real world is a complex, time consuming
and potentially expensive task. Despite significant success of reinforcement
learning in games and simulations, research in real robot applications has not
been able to match similar progress. While sample complexity can be reduced by
training policies in simulation, such policies can perform sub-optimally on the
real platform given imperfect calibration of model dynamics. We present an
approach -- supplemental to fine tuning on the real robot -- to further benefit
from parallel access to a simulator during training and reduce sample
requirements on the real robot. The developed approach harnesses auxiliary
rewards to guide the exploration for the real world agent based on the
proficiency of the agent in simulation and vice versa. In this context, we
demonstrate empirically that the reciprocal alignment for both agents provides
further benefit as the agent in simulation can adjust to optimize its behaviour
for states commonly visited by the real-world agent.

通过强化学习在模拟环境中训练机器人并结合补充奖励策略，与真实机器人进行进一步的微调来优化探索策略，实验结果表明，这种相互对齐的方法可以在真实和模拟环境中实现更好的性能。

相互对齐迁移学习

Mutual Alignment Transfer Learning

In 1977, Young proposed a voting scheme that extends the Condorcet Principle
based on the fewest possible number of voters whose removal yields a Condorcet
winner. We prove that both the winner and the ranking problem for Young
elections is complete for the class of problems solvable in polynomial time by
parallel access to NP. Analogous results for Lewis Carroll's 1876 voting scheme
were recently established by Hemaspaandra et al. In contrast, we prove that the
winner and ranking problems in Fishburn's homogeneous variant of Carroll's
voting scheme can be solved efficiently by linear programming.

本文研究了三种投票方案的胜者问题和排名问题的复杂度，发现 Young 的方案和 Lewis Carroll 的方案都是 NP 难问题，而 Fishburn 的方案可以通过线性规划高效求解。