Reactive (memoryless) policies are sufficient in completely observable Markov decision processes (MDPs), but some kind of memory is usually necessary for optimal control of a partially observable MDP. Policies with finite memory can be represented as finite-state automata. In this paper, we extend Baird and Moore's VAPS algorithm to the problem of learning general finite-state automata. Because it performs stochastic gradient descent, this algorithm can be shown to converge to a locally optimal finite-state controller. We provide the details of the algorithm and then consider the question of under what conditions stochastic gradient descent will outperform exact gradient descent. We conclude with empirical results comparing the performance of stochastic and exact gradient descent, and showing the ability of our algorithm to extract the useful information contained in the sequence of past observations to compensate for the lack of observability at each time-step.

本文介绍了使用有限状态自动机表示具有有限记忆的策略学习算法，具体探讨在部分可观测的MDP问题中，基于随机梯度下降的VAPS算法进行本地优化的通用有限状态自动机控制器的问题。并进一步讨论了在何种条件下随机梯度下降将优于精确梯度下降的问题，通过实证研究验证了该算法在补偿每个时间步上的不可观测性方面发挥了良好的效果。

学习部分可观测环境的有限状态控制器