We consider the challenge of policy simplification and verification in the
context of policies learned through reinforcement learning (RL) in continuous
environments. In well-behaved settings, RL algorithms have convergence
guarantees in the limit. While these guarantees are valuable, they are
insufficient for safety-critical applications. Furthermore, they are lost when
applying advanced techniques such as deep-RL. To recover guarantees when
applying advanced RL algorithms to more complex environments with (i)
reachability, (ii) safety-constrained reachability, or (iii) discounted-reward
objectives, we build upon the DeepMDP framework introduced by Gelada et al. to
derive new bisimulation bounds between the unknown environment and a learned
discrete latent model of it. Our bisimulation bounds enable the application of
formal methods for Markov decision processes. Finally, we show how one can use
a policy obtained via state-of-the-art RL to efficiently train a variational
autoencoder that yields a discrete latent model with provably approximately
correct bisimulation guarantees. Additionally, we obtain a distilled version of
the policy for the latent model.

在强化学习中，为了解决政策简化和验证的挑战，作者们构建了 DeepMDP 框架，基于该框架可以得到未知环境和离散潜在模型之间的新的双模拟边界，该边界可以为 MDP 的形式方法应用提供支持。同时，作者们还演示了如何通过最先进的 RL 获得一个政策，并使用该政策有效地训练一个 VAE 模型，获得这个模型的双模拟保证的近似正确性的提炼版。