Many real-world reinforcement learning (RL) problems necessitate learning
complex, temporally extended behavior that may only receive reward signal when
the behavior is completed. If the reward-worthy behavior is known, it can be
specified in terms of a non-Markovian reward function - a function that depends
on aspects of the state-action history, rather than just the current state and
action. Such reward functions yield sparse rewards, necessitating an inordinate
number of experiences to find a policy that captures the reward-worthy pattern
of behavior. Recent work has leveraged Knowledge Representation (KR) to provide
a symbolic abstraction of aspects of the state that summarize reward-relevant
properties of the state-action history and support learning a Markovian
decomposition of the problem in terms of an automaton over the KR. Providing
such a decomposition has been shown to vastly improve learning rates,
especially when coupled with algorithms that exploit automaton structure.
Nevertheless, such techniques rely on a priori knowledge of the KR. In this
work, we explore how to automatically discover useful state abstractions that
support learning automata over the state-action history. The result is an
end-to-end algorithm that can learn optimal policies with significantly fewer
environment samples than state-of-the-art RL on simple non-Markovian domains.

利用知识表示和自动机结构，本文提出了一种自动发现有用状态抽象的端对端算法，用于学习非 Markov 领域下优化策略，相较于最先进的强化学习算法，能够在更少的环境样本下得到更优的结果。