This paper presents a state representation for reward-free Markov decision
processes. The idea is to learn, in a self-supervised manner, an embedding
space where distances between pairs of embedded states correspond to the
minimum number of actions needed to transition between them. Unlike previous
methods, our approach incorporates an asymmetric norm parametrization, enabling
accurate approximations of minimum action distances in environments with
inherent asymmetry. We show how this representation can be leveraged to learn
goal-conditioned policies, providing a notion of similarity between states and
goals and a useful heuristic distance to guide planning. To validate our
approach, we conduct empirical experiments on both symmetric and asymmetric
environments. Our results show that our asymmetric norm parametrization
performs comparably to symmetric norms in symmetric environments and surpasses
symmetric norms in asymmetric environments.

本研究提出了一种奖励无关的马尔可夫决策过程的状态表示方法，通过自我监督学习嵌入空间，使得嵌入状态对之间的距离对应于在它们之间转换所需的最小动作数。与之前的方法不同，我们的方法采用了非对称范数参数化，可以在具有固有不对称性的环境中准确近似最小动作距离。我们展示了如何利用这种表示方法来学习目标条件策略，提供了状态和目标之间的相似度概念和有用的启发式距离来指导规划。为了验证我们的方法，我们在对称和不对称环境下进行了实证实验。结果表明，我们的非对称范数参数化在对称环境中与对称范数表现相当，在不对称环境中超过对称范数。