Deep Reinforcement Learning has yielded proficient controllers for complex tasks. However, these controllers have limited memory and rely on being able to perceive the complete game screen at each decision point. To address these shortcomings, this article investigates the effects of adding recurrency to a Deep Q-Network (DQN) by replacing the first post-convolutional fully-connected layer with a recurrent LSTM. The resulting Deep Recurrent Q-Network (DRQN) exhibits similar performance on standard Atari 2600 MDPs but better performance on equivalent partially observed domains featuring flickering game screens. Results indicate that given the same length of history, recurrency allows partial information to be integrated through time and is superior to alternatives such as stacking a history of frames in the network's input layer. We additionally show that when trained with partial observations, DRQN's performance at evaluation time scales as a function of observability. Similarly, when trained with full observations and evaluated with partial observations, DRQN's performance degrades more gracefully than that of DQN. We therefore conclude that when dealing with partially observed domains, the use of recurrency confers tangible benefits.

本文介绍了一种新型深度强化学习模型Deep Recurrent Q-Network(DRQN)，使用recurrent LSTM替换DQN的第一个后卷积全连接层，DRQN在每个决策点只看到一个帧，但可以成功地通过时间积分信息，并且在标准的Atari游戏和部分不完整的游戏中表现出与DQN相似的性能，且在不同可观察性情况下DRQN的性能也随之变化。因此，recurrency是DQN的一种可替代方式。

部分可观察MDPs的深度循环Q学习