Recent advances in deep reinforcement learning algorithms have shown great potential and success for solving many challenging real-world problems, including Go game and robotic applications. Usually, these algorithms need a carefully designed reward function to guide training in each time step. However, in real world, it is non-trivial to design such a reward function, and the only signal available is usually obtained at the end of a trajectory, also known as the episodic reward or return. In this work, we introduce a new algorithm for temporal credit assignment, which learns to decompose the episodic return back to each time-step in the trajectory using deep neural networks. With this learned reward signal, the learning efficiency can be substantially improved for episodic reinforcement learning. In particular, we find that expressive language models such as the Transformer can be adopted for learning the importance and the dependency of states in the trajectory, therefore providing high-quality and interpretable learned reward signals. We have performed extensive experiments on a set of MuJoCo continuous locomotive control tasks with only episodic returns and demonstrated the effectiveness of our algorithm.

本文介绍了一种新的时间信用分配算法，使用深度神经网络将时间步骤分解为每个步骤，并采用 Transformer 语言模型学习轨迹状态的重要性和依赖性，可大幅提高回路强化学习的学习效率。作者在一组具有连续运动控制任务的 MuJoCo 上进行了广泛的实验，并证明了该算法的有效性。

序列建模：针对情节强化学习的时间性信用分配