We consider the challenge of policy simplification and verification in the
context of policies learned through reinforcement learning (RL) in continuous
environments. In well-behaved settings, RL algorithms have convergence
guarantees in the limit. While these guarantees are valuable, they are
insufficient for safety-critical applications. Furthermore, they are lost when
applying advanced techniques such as deep-RL. To recover guarantees when
applying advanced RL algorithms to more complex environments with (i)
reachability, (ii) safety-constrained reachability, or (iii) discounted-reward
objectives, we build upon the DeepMDP framework introduced by Gelada et al. to
derive new bisimulation bounds between the unknown environment and a learned
discrete latent model of it. Our bisimulation bounds enable the application of
formal methods for Markov decision processes. Finally, we show how one can use
a policy obtained via state-of-the-art RL to efficiently train a variational
autoencoder that yields a discrete latent model with provably approximately
correct bisimulation guarantees. Additionally, we obtain a distilled version of
the policy for the latent model.

在强化学习中，为了解决政策简化和验证的挑战，作者们构建了 DeepMDP 框架，基于该框架可以得到未知环境和离散潜在模型之间的新的双模拟边界，该边界可以为 MDP 的形式方法应用提供支持。同时，作者们还演示了如何通过最先进的 RL 获得一个政策，并使用该政策有效地训练一个 VAE 模型，获得这个模型的双模拟保证的近似正确性的提炼版。

通过 MDP 的变分抽象以形式化保证实现 RL 策略的蒸馏（技术报告）

Distillation of RL Policies with Formal Guarantees via Variational  Abstraction of Markov Decision Processes (Technical Report)

Many reinforcement learning (RL) tasks provide the agent with
high-dimensional observations that can be simplified into low-dimensional
continuous states. To formalize this process, we introduce the concept of a
DeepMDP, a parameterized latent space model that is trained via the
minimization of two tractable losses: prediction of rewards and prediction of
the distribution over next latent states. We show that the optimization of
these objectives guarantees (1) the quality of the latent space as a
representation of the state space and (2) the quality of the DeepMDP as a model
of the environment. We connect these results to prior work in the bisimulation
literature, and explore the use of a variety of metrics. Our theoretical
findings are substantiated by the experimental result that a trained DeepMDP
recovers the latent structure underlying high-dimensional observations on a
synthetic environment. Finally, we show that learning a DeepMDP as an auxiliary
task in the Atari 2600 domain leads to large performance improvements over
model-free RL.

介绍了一种参数化潜变量空间模型 DeepMDP，通过学习奖励和下一个潜变量状态的预测来训练模型，以提高强化学习中连续状态的表示效果，并证明其在 Atari 2600 游戏中可以明显提高模型性能。