Providing densely shaped reward functions for RL algorithms is often
exceedingly challenging, motivating the development of RL algorithms that can
learn from easier-to-specify sparse reward functions. This sparsity poses new
exploration challenges. One common way to address this problem is using
demonstrations to provide initial signal about regions of the state space with
high rewards. However, prior RL from demonstrations algorithms introduce
significant complexity and many hyperparameters, making them hard to implement
and tune. We introduce Monte Carlo Augmented Actor Critic (MCAC), a parameter
free modification to standard actor-critic algorithms which initializes the
replay buffer with demonstrations and computes a modified $Q$-value by taking
the maximum of the standard temporal distance (TD) target and a Monte Carlo
estimate of the reward-to-go. This encourages exploration in the neighborhood
of high-performing trajectories by encouraging high $Q$-values in corresponding
regions of the state space. Experiments across $5$ continuous control domains
suggest that MCAC can be used to significantly increase learning efficiency
across $6$ commonly used RL and RL-from-demonstrations algorithms. See
this https URL for code and supplementary material.

提供 RL 算法的稠密形状奖励函数往往非常具有挑战性，因此发展能够从易于指定的稀疏奖励函数中学习的 RL 算法已成为研究动机。为解决因奖励稀疏性而引入的新的探索挑战，我们引入了 Monte Carlo 增强 Actor Critic (MCAC)，发现它可以显著提高学习效率。