Policy learning for partially observed control tasks requires policies that
can remember salient information from past observations. In this paper, we
present a method for learning policies with internal memory for
high-dimensional, continuous systems, such as robotic manipulators. Our
approach consists of augmenting the state and action space of the system with
continuous-valued memory states that the policy can read from and write to.
Learning general-purpose policies with this type of memory representation
directly is difficult, because the policy must automatically figure out the
most salient information to memorize at each time step. We show that, by
decomposing this policy search problem into a trajectory optimization phase and
a supervised learning phase through a method called guided policy search, we
can acquire policies with effective memorization and recall strategies.
Intuitively, the trajectory optimization phase chooses the values of the memory
states that will make it easier for the policy to produce the right action in
future states, while the supervised learning phase encourages the policy to use
memorization actions to produce those memory states. We evaluate our method on
tasks involving continuous control in manipulation and navigation settings, and
show that our method can learn complex policies that successfully complete a
range of tasks that require memory.

通过内部记忆的方法，学习适用于高维连续系统（如机器人操纵器）的策略，通过把记忆状态加到系统的状态和动作空间中，使用有监督学习方法 Guided Policy Search 分解策略搜索问题，并通过轨迹优化和监督学习相结合的方式获得具有有效记忆和回忆策略的复杂策略