We propose a new method to study the internal memory used by reinforcement
learning policies. We estimate the amount of relevant past information by
estimating mutual information between behavior histories and the current action
of an agent. We perform this estimation in the passive setting, that is, we do
not intervene but merely observe the natural behavior of the agent. Moreover,
we provide a theoretical justification for our approach by showing that it
yields an implementation-independent lower bound on the minimal memory capacity
of any agent that implement the observed policy. We demonstrate our approach by
estimating the use of memory of DQN policies on concatenated Atari frames,
demonstrating sharply different use of memory across 49 games. The study of
memory as information that flows from the past to the current action opens
avenues to understand and improve successful reinforcement learning algorithms.

提出一种新方法来研究强化学习策略所使用的内部记忆，通过估计行为历史与代理人当前动作之间的互信息来估计相关的过去信息量，并在被动设置下进行这种估计。此外，通过显示它产生了一个实现无关的最小内存容量下界，为我们的方法提供了理论上的理由。作者对对 DQN 政策上的 atari 游戏做出评估，并展示了在 49 个游戏中不同的记忆使用情况。