Data selection is essential for any data-based optimization technique, such as Reinforcement Learning. State-of-the-art sampling strategies for the experience replay buffer improve the performance of the Reinforcement Learning agent. However, they do not incorporate uncertainty in the Q-Value estimation. Consequently, they cannot adapt the sampling strategies, including exploration and exploitation of transitions, to the complexity of the task. To address this, this paper proposes a new sampling strategy that leverages the exploration-exploitation trade-off. This is enabled by the uncertainty estimation of the Q-Value function, which guides the sampling to explore more significant transitions and, thus, learn a more efficient policy. Experiments on classical control environments demonstrate stable results across various environments. They show that the proposed method outperforms state-of-the-art sampling strategies for dense rewards w.r.t. convergence and peak performance by 26% on average.

本文提出了一种新的采样策略，基于Q值函数的不确定性估计，指导采样探索更重要的转移，从而学习到更有效的策略，实验表明，在各种环境下，该方法在收敛和峰值性能方面的表现平均超过现有策略26%。

MEET: 一种用于缓冲区采样的Monte Carlo 探索-利用权衡算法