Providing densely shaped reward functions for RL algorithms is often
exceedingly challenging, motivating the development of RL algorithms that can
learn from easier-to-specify sparse reward functions. This sparsity poses new
exploration challenges. One common way to address this problem is using
demonstrations to provide initial signal about regions of the state space with
high rewards. However, prior RL from demonstrations algorithms introduce
significant complexity and many hyperparameters, making them hard to implement
and tune. We introduce Monte Carlo Augmented Actor Critic (MCAC), a parameter
free modification to standard actor-critic algorithms which initializes the
replay buffer with demonstrations and computes a modified $Q$-value by taking
the maximum of the standard temporal distance (TD) target and a Monte Carlo
estimate of the reward-to-go. This encourages exploration in the neighborhood
of high-performing trajectories by encouraging high $Q$-values in corresponding
regions of the state space. Experiments across $5$ continuous control domains
suggest that MCAC can be used to significantly increase learning efficiency
across $6$ commonly used RL and RL-from-demonstrations algorithms. See
this https URL for code and supplementary material.

提供 RL 算法的稠密形状奖励函数往往非常具有挑战性，因此发展能够从易于指定的稀疏奖励函数中学习的 RL 算法已成为研究动机。为解决因奖励稀疏性而引入的新的探索挑战，我们引入了 Monte Carlo 增强 Actor Critic (MCAC)，发现它可以显著提高学习效率。

蒙特卡罗增强演员 - 评论家算法处理来自次优演示的稀疏奖励深度强化学习

Monte Carlo Augmented Actor-Critic for Sparse Reward Deep Reinforcement Learning from Suboptimal Demonstrations

Deep Reinforcement Learning has shown tremendous success in solving several
games and tasks in robotics. However, unlike humans, it generally requires a
lot of training instances. Trajectories imitating to solve the task at hand can
help to increase sample-efficiency of deep RL methods. In this paper, we
present a simple approach to use such trajectories, applied to the challenging
Ball-in-Maze Games, recently introduced in the literature. We show that in
spite of not using human-generated trajectories and just using the simulator as
a model to generate a limited number of trajectories, we can get a speed-up of
about 2-3x in the learning process. We also discuss some challenges we observed
while using trajectory-based learning for very sparse reward functions.

本研究介绍了一种简单的方法来应用轨迹学习方法，以增加深度强化学习方法的样本效率，该方法应用于文献中最近引入的具有挑战性的球迷宫游戏，并展示了通过使用模拟器为模型生成有限数量的轨迹，可以在不使用人工生成轨迹的情况下，获得约 2-3 倍的学习加速度，同时讨论了在使用基于轨迹的学习处理非常稀疏的奖励函数时面临的一些挑战。