We propose a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods. While spectral methods have been previously employed for consistent learning of (passive) latent variable models such as hidden Markov models, POMDPs are more challenging since the learner interacts with the environment and possibly changes the future observations in the process. We devise a learning algorithm running through episodes, in each episode we employ spectral techniques to learn the POMDP parameters from a trajectory generated by a fixed policy. At the end of the episode, an optimization oracle returns the optimal memoryless planning policy which maximizes the expected reward based on the estimated POMDP model. We prove an order-optimal regret bound w.r.t. the optimal memoryless policy and efficient scaling with respect to the dimensionality of observation and action spaces.

提出了一种新的强化学习算法用于部分可观察的马尔可夫决策过程(POMDP)，该算法基于谱分解方法，学习参数通过固定政策生成的轨迹，并通过优化oracle返回最优的无记忆规划策略，算法可以有效缩放观测和行动空间的维度。

使用谱方法强化学习POMDPs