In this paper, we address the problem of cumulative reward overestimation in deep reinforcement learning methodologically. We generalise notions from information-theoretic bounded rationality to handle high-dimensional state spaces efficiently. The resultant algorithm encompasses a wide range of learning outcomes that can be demonstrated by tuning a Lagrange multiplier that intrinsically penalises rewards. We show that deep Q-networks fall naturally as a special case from our proposed approach. We further contribute by introducing a novel scheduling scheme for bounded-rational behaviour that ensures sample efficiency and robustness. In experiments on Atari games, we show that our algorithm outperforms various deep reinforcement learning algorithms (e.g., deep and double deep Q-networks) in terms of both, game- play performance and sample complexity.

本文介绍了一种深度强化学习方法，借鉴信息论的概念，引入一种内在罚信号以鼓励减少Q值估计，为了确保高效且稳健的学习，同时还提出了一种新颖的Lagrange乘子调度方案，并在Atari上的实验结果表明，该算法在游戏表现和样本复杂度方面都优于其他算法（如深度和双深度Q网络），这些结果在最近提出的Dueling架构下仍然有效。

深度强化学习的信息理论最优性原则