Deep reinforcement learning algorithms that estimate state and state-action
value functions have been shown to be effective in a variety of challenging
domains, including learning control strategies from raw image pixels. However,
algorithms that estimate state and state-action value functions typically
assume a fully observed state and must compensate for partial observations by
using finite length observation histories or recurrent networks. In this work,
we propose a new deep reinforcement learning algorithm based on counterfactual
regret minimization that iteratively updates an approximation to an
advantage-like function and is robust to partially observed state. We
demonstrate that this new algorithm can substantially outperform strong
baseline methods on several partially observed reinforcement learning tasks:
learning first-person 3D navigation in Doom and Minecraft, and acting in the
presence of partially observed objects in Doom and Pong.

本研究提出了一种新的基于反事实遗憾最小化的深度强化学习算法，能够有效处理部分观测状态，并在 Doom 和 Minecraft 中的学习第一人称的 3D 导航以及在 Doom 和 Pong 中进行部分观测对象的动作等强化学习任务中显著优于现有基线算法。