Reinforcement learning on high-dimensional and complex problems relies on
abstraction for improved efficiency and generalization. In this paper, we study
abstraction in the continuous-control setting, and extend the definition of MDP
homomorphisms to the setting of continuous state and action spaces. We derive a
policy gradient theorem on the abstract MDP for both stochastic and
deterministic policies. Our policy gradient results allow for leveraging
approximate symmetries of the environment for policy optimization. Based on
these theorems, we propose a family of actor-critic algorithms that are able to
learn the policy and the MDP homomorphism map simultaneously, using the lax
bisimulation metric. Finally, we introduce a series of environments with
continuous symmetries to further demonstrate the ability of our algorithm for
action abstraction in the presence of such symmetries. We demonstrate the
effectiveness of our method on our environments, as well as on challenging
visual control tasks from the DeepMind Control Suite. Our method's ability to
utilize MDP homomorphisms for representation learning leads to improved
performance, and the visualizations of the latent space clearly demonstrate the
structure of the learned abstraction.

本研究旨在通过抽象来提高强化学习在高维度和复杂问题上的效率和泛化能力，并在连续控制环境中研究抽象的概念，提出了一系列基于异构度量的策略梯度算法以及具有连续对称性的环境来证明该算法的效果，结果表明该算法利用 MDP 同态性进行表示学习可以提高其性能。

存在对称性和状态抽象的策略梯度方法

Policy Gradient Methods in the Presence of Symmetries and State  Abstractions

Animals are able to rapidly infer from limited experience when sets of state
action pairs have equivalent reward and transition dynamics. On the other hand,
modern reinforcement learning systems must painstakingly learn through trial
and error that sets of state action pairs are value equivalent -- requiring an
often prohibitively large amount of samples from their environment. MDP
homomorphisms have been proposed that reduce the observed MDP of an environment
to an abstract MDP, which can enable more sample efficient policy learning.
Consequently, impressive improvements in sample efficiency have been achieved
when a suitable MDP homomorphism can be constructed a priori -- usually by
exploiting a practioner's knowledge of environment symmetries. We propose a
novel approach to constructing a homomorphism in discrete action spaces, which
uses a partial model of environment dynamics to infer which state action pairs
lead to the same state -- reducing the size of the state-action space by a
factor equal to the cardinality of the action space. We call this method
equivalent effect abstraction. In a gridworld setting, we demonstrate
empirically that equivalent effect abstraction can improve sample efficiency in
a model-free setting and planning efficiency for modelbased approaches.
Furthermore, we show on cartpole that our approach outperforms an existing
method for learning homomorphisms, while using 33x less training data.

提出了一种新方法，即等效效果抽象，该方法利用环境动态的部分模型推断导致相同状态的状态动作对，从而将状态动作空间的大小减少一个等于动作空间基数的因子，以提高采样效率和规划效率。在网格世界环境下，通过实验证明，等效效果抽象可以在模型自由设置和基于模型的方法的规划效率中提高采样效率。此外，通过在车杆环境中进行实验，还表明本方法比现有方法更优秀，在使用 33 倍少的训练数据的情况下实现了更好的表现。

一种使用学习 MDP 同态的状态 - 动作抽象简易方法

A Simple Approach for State-Action Abstraction using a Learned MDP Homomorphism

Abstraction of Markov Decision Processes is a useful tool for solving complex
problems, as it can ignore unimportant aspects of an environment, simplifying
the process of learning an optimal policy. In this paper, we propose a new
algorithm for finding abstract MDPs in environments with continuous state
spaces. It is based on MDP homomorphisms, a structure-preserving mapping
between MDPs. We demonstrate our algorithm's ability to learn abstractions from
collected experience and show how to reuse the abstractions to guide
exploration in new tasks the agent encounters. Our novel task transfer method
outperforms baselines based on a deep Q-network in the majority of our
experiments. The source code is at this https URL

本论文提出了一种新的算法来找到在具有连续状态空间的环境中的 MDP 抽象，基于 MDP 同态，该算法演示了抽象学习的能力并展示了如何重用这些抽象来引导在新任务中的探索。论文中的任务转移方法在大多数实验中优于基于深度 Q 网络的基准线。