Solving control tasks in complex environments automatically through learning
offers great potential. While contemporary techniques from deep reinforcement
learning (DRL) provide effective solutions, their decision-making is not
transparent. We aim to provide insights into the decisions faced by the agent
by learning an automaton model of environmental behavior under the control of
an agent. However, for most control problems, automata learning is not scalable
enough to learn a useful model. In this work, we raise the capabilities of
automata learning such that it is possible to learn models for environments
that have complex and continuous dynamics.
The core of the scalability of our method lies in the computation of an
abstract state-space representation, by applying dimensionality reduction and
clustering on the observed environmental state space. The stochastic
transitions are learned via passive automata learning from observed
interactions of the agent and the environment. In an iterative model-based RL
process, we sample additional trajectories to learn an accurate environment
model in the form of a discrete-state Markov decision process (MDP). We apply
our automata learning framework on popular RL benchmarking environments in the
OpenAI Gym, including LunarLander, CartPole, Mountain Car, and Acrobot. Our
results show that the learned models are so precise that they enable the
computation of policies solving the respective control tasks. Yet the models
are more concise and more general than neural-network-based policies and by
using MDPs we benefit from a wealth of tools available for analyzing them. When
solving the task of LunarLander, the learned model even achieved similar or
higher rewards than deep RL policies learned with stable-baselines3.

通过深度强化学习、自动化学习和马尔可夫决策过程等技术，学习出由自主智能体控制的环境模型，以解决复杂环境下的控制问题，并在多个强化学习基准环境中验证了方法的有效性。

连续随机动力学学习环境模型

Learning Environment Models with Continuous Stochastic Dynamics

Our work aims at developing reinforcement learning algorithms that do not
rely on the Markov assumption. We consider the class of Non-Markov Decision
Processes where histories can be abstracted into a finite set of states while
preserving the dynamics. We call it a Markov abstraction since it induces a
Markov Decision Process over a set of states that encode the non-Markov
dynamics. This phenomenon underlies the recently introduced Regular Decision
Processes (as well as POMDPs where only a finite number of belief states is
reachable). In all such kinds of decision process, an agent that uses a Markov
abstraction can rely on the Markov property to achieve optimal behaviour. We
show that Markov abstractions can be learned during reinforcement learning. Our
approach combines automata learning and classic reinforcement learning. For
these two tasks, standard algorithms can be employed. We show that our approach
has PAC guarantees when the employed algorithms have PAC guarantees, and we
also provide an experimental evaluation.

本文提出了一种结合自动机学习和经典强化学习的算法，用于学习非马尔可夫决策流程中的马尔科夫抽象，并且证明该算法具有 PAC 保证。