A long-standing problem in model free reinforcement learning (RL) is that it requires a large number of trials to learn a good policy, especially in environments with sparse rewards. We explore a method to increase the sample efficiency of RL when we have access to demonstrations. Our approach, which we call Backplay, uses a single demonstration to construct a curriculum for a given task. Rather than starting each training episode in the environment's fixed initial state, we start the agent near the end of the demonstration and move the starting point backwards during the course of training until we reach the initial state. We perform experiments in a competitive four player game (Pommerman) and a path-finding maze game. We find that this weak form of guidance provides significant gains in sample complexity with a stark advantage in sparse reward environments. In some cases, standard RL did not yield any improvement while Backplay reached success rates greater than 50% and generalized to unseen initial conditions in the same amount of training time. Additionally, we see that agents trained via Backplay can learn policies superior to those of the original demonstration.

提高样本效率是模型自由强化学习中的一个挑战，本文提出了一种名为Backplay的方法，利用单个演示构建任务的课程并以该演示的末端为起点进行训练，最终在可竞争方法中优化训练速度。

回放：必须不停地倒转