Real-world sequential decision making problems commonly involve partial observability, which requires the agent to maintain a memory of history in order to infer the latent states, plan and make good decisions. Coping with partial observability in general is extremely challenging, as a number of worst-case statistical and computational barriers are known in learning Partially Observable Markov Decision Processes (POMDPs). Motivated by the problem structure in several physical applications, as well as a commonly used technique known as "frame stacking", this paper proposes to study a new subclass of POMDPs, whose latent states can be decoded by the most recent history of a short length $m$. We establish a set of upper and lower bounds on the sample complexity for learning near-optimal policies for this class of problems in both tabular and rich-observation settings (where the number of observations is enormous). In particular, in the rich-observation setting, we develop new algorithms using a novel "moment matching" approach with a sample complexity that scales exponentially with the short length $m$ rather than the problem horizon, and is independent of the number of observations. Our results show that a short-term memory suffices for reinforcement learning in these environments.

本文研究如何学习部分可观察的马尔科夫决策过程。通过构造一种特殊的子类POMDP，它的隐状态可以通过历史的近期记录来解码。我们使用新颖的瞬时匹配方法，并建立了一组在表格和丰富观察设置下，学习这类问题的近优策略的样本复杂性的上下界，并证明了短期记忆对于这些环境的强化学习已经足够。

具有短期记忆的可证明强化学习