While reinforcement learning (RL) has shown promising performance, its sample
complexity continues to be a substantial hurdle, restricting its broader
application across a variety of domains. Imitation learning (IL) utilizes
oracles to improve sample efficiency, yet it is often constrained by the
quality of the oracles deployed. which actively interleaves between IL and RL
based on an online estimate of their performance. RPI draws on the strengths of
IL, using oracle queries to facilitate exploration, an aspect that is notably
challenging in sparse-reward RL, particularly during the early stages of
learning. As learning unfolds, RPI gradually transitions to RL, effectively
treating the learned policy as an improved oracle. This algorithm is capable of
learning from and improving upon a diverse set of black-box oracles. Integral
to RPI are Robust Active Policy Selection (RAPS) and Robust Policy Gradient
(RPG), both of which reason over whether to perform state-wise imitation from
the oracles or learn from its own value function when the learner's performance
surpasses that of the oracles in a specific state. Empirical evaluations and
theoretical analysis validate that RPI excels in comparison to existing
state-of-the-art methodologies, demonstrating superior performance across
various benchmark domains.

该研究通过融合强化学习和模仿学习的方法，利用自适应的策略选择和梯度优化算法，在稀疏奖励场景下有效提高样本效率，并在多个基准领域中展现出卓越的性能。