In some reinforcement learning problems an agent may be provided with a set
of input policies, perhaps learned from prior experience or provided by
advisors. We present a reinforcement learning with policy advice (RLPA)
algorithm which leverages this input set and learns to use the best policy in
the set for the reinforcement learning task at hand. We prove that RLPA has a
sub-linear regret of \tilde O(\sqrt{T}) relative to the best input policy, and
that both this regret and its computational complexity are independent of the
size of the state and action space. Our empirical simulations support our
theoretical analysis. This suggests RLPA may offer significant advantages in
large domains where some prior good policies are provided.

本文提出了一种强化学习与策略建议（RLPA）算法，可以利用提供的一组输入策略并学会使用最佳策略来解决当前的强化学习任务。我们证明了算法的深度复杂度和次线性遗憾与最佳输入策略相对应，而这种遗憾和复杂度与状态和动作空间的大小无关。我们的实验模拟支持我们的理论分析。这表明 RLPA 可能在提供先前良好策略的大型领域中具有重要优势。