Gradient-based approaches to direct policy search in reinforcement learning
have received much recent attention as a means to solve problems of partial
observability and to avoid some of the problems associated with policy
degradation in value-function methods. In this paper we introduce GPOMDP, a
simulation-based algorithm for generating a {\em biased} estimate of the
gradient of the {\em average reward} in Partially Observable Markov Decision
Processes (POMDPs) controlled by parameterized stochastic policies. A similar
algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The
algorithm's chief advantages are that it requires storage of only twice the
number of policy parameters, uses one free parameter $\beta\in [0,1)$ (which
has a natural interpretation in terms of bias-variance trade-off), and requires
no knowledge of the underlying state. We prove convergence of GPOMDP, and show
how the correct choice of the parameter $\beta$ is related to the {\em mixing
time} of the controlled POMDP. We briefly describe extensions of GPOMDP to
controlled Markov chains, continuous state, observation and control spaces,
multiple-agents, higher-order derivatives, and a version for training
stochastic policies with internal states. In a companion paper (Baxter,
Bartlett, & Weaver, 2001) we show how the gradient estimates generated by
GPOMDP can be used in both a traditional stochastic gradient algorithm and a
conjugate-gradient procedure to find local optima of the average reward

本文提出了一种名为 GPOMDP 的基于模拟的算法，用于在部分可观测马尔可夫决策过程（POMDPs）中控制参数化随机策略，生成偏差估计的平均奖励梯度。