Off-policy reinforcement learning (RL) using a fixed offline dataset of
logged interactions is an important consideration in real world applications.
This paper studies offline RL using the DQN replay dataset comprising the
entire replay experience of a DQN agent on 60 Atari 2600 games. We demonstrate
that recent off-policy deep RL algorithms, even when trained solely on this
fixed dataset, outperform the fully trained DQN agent. To enhance
generalization in the offline setting, we present Random Ensemble Mixture
(REM), a robust Q-learning algorithm that enforces optimal Bellman consistency
on random convex combinations of multiple Q-value estimates. Offline REM
trained on the DQN replay dataset surpasses strong RL baselines. Ablation
studies highlight the role of offline dataset size and diversity as well as the
algorithm choice in our positive results. Overall, the results here present an
optimistic view that robust RL algorithms trained on sufficiently large and
diverse offline datasets can lead to high quality policies. The DQN replay
dataset can serve as an offline RL benchmark and is open-sourced.

该研究使用 DQN 重放数据集研究了离线强化学习，提出了随机集合混合（REM）算法以促进泛化，得到比经过完全训练的 DQN 代理更好的结果。这表明，针对足够大且多样化的离线数据集进行训练的鲁棒强化学习算法可以导致高质量的策略。