In this paper, we propose to combine imitation and reinforcement learning via the idea of reward shaping using an oracle. We study the effectiveness of the near-optimal cost-to-go oracle on the planning horizon and demonstrate that the cost-to-go oracle shortens the learner's planning horizon as function of its accuracy: a globally optimal oracle can shorten the planning horizon to one, leading to a one-step greedy Markov Decision Process which is much easier to optimize, while an oracle that is far away from the optimality requires planning over a longer horizon to achieve near-optimal performance. Hence our new insight bridges the gap and interpolates between imitation learning and reinforcement learning. Motivated by the above mentioned insights, we propose Truncated HORizon Policy Search (THOR), a method that focuses on searching for policies that maximize the total reshaped reward over a finite planning horizon when the oracle is sub-optimal. We experimentally demonstrate that a gradient-based implementation of THOR can achieve superior performance compared to RL baselines and IL baselines even when the oracle is sub-optimal.

研究围绕奖励塑造的概念，提出了将模仿学习和强化学习相结合的新思路，通过近似最优的代价预测器将其融合，形成Truncated HORizon Policy Search (THOR)方法，以搜索对于近似最优代价预测器的有限规划下实现最大总重构奖励的策略。实验证明了THOR可以在代价预测器不是全局最优的情况下取得比强化学习和模仿学习更好的表现。

截断视野策略搜索：结合强化学习与模仿学习