As reinforcement learning agents are tasked with solving more challenging and
diverse tasks, the ability to incorporate prior knowledge into the learning
system and to exploit reusable structure in solution space is likely to become
increasingly important. The KL-regularized expected reward objective
constitutes one possible tool to this end. It introduces an additional
component, a default or prior behavior, which can be learned alongside the
policy and as such partially transforms the reinforcement learning problem into
one of behavior modelling. In this work we consider the implications of this
framework in cases where both the policy and default behavior are augmented
with latent variables. We discuss how the resulting hierarchical structures can
be used to implement different inductive biases and how their modularity can
benefit transfer. Empirically we find that they can lead to faster learning and
transfer on a range of continuous control tasks.

本文提出一种基于 KL 正则化预期奖励目标的强化学习代理方法，它可以利用先验知识并在解决方案空间中利用可重复使用的结构，同时讨论了在增加潜在变量的情况下如何实现分层结构的不同归纳偏置以及其中的置换学习问题。实验证明，这种代理方法可以应用于不同的连续控制任务中，获得更快的学习和置换效果。