We study the problem of predicting and controlling the future state
distribution of an autonomous agent. This problem, which can be viewed as a
reframing of goal-conditioned reinforcement learning (RL), is centered around
learning a conditional probability density function over future states. Instead
of directly estimating this density function, we indirectly estimate this
density function by training a classifier to predict whether an observation
comes from the future. Via Bayes' rule, predictions from our classifier can be
transformed into predictions over future states. Importantly, an off-policy
variant of our algorithm allows us to predict the future state distribution of
a new policy, without collecting new experience. This variant allows us to
optimize functionals of a policy's future state distribution, such as the
density of reaching a particular goal state. While conceptually similar to
Q-learning, our work lays a principled foundation for goal-conditioned RL as
density estimation, providing justification for goal-conditioned methods used
in prior work. This foundation makes hypotheses about Q-learning, including the
optimal goal-sampling ratio, which we confirm experimentally. Moreover, our
proposed method is competitive with prior goal-conditioned RL methods.

探究了预测和控制自主智能体未来状态分布的问题，提出通过训练分类器间接地估计条件概率密度函数来解决，进而探究了基于 Q-learning 的目标条件强化学习方法的理论基础和假设，并且提出了可以预测新政策未来状态分布的算法。

C-Learning: 通过递归分类实现目标的学习

C-Learning: Learning to Achieve Goals via Recursive Classification

This work considers two distinct settings: imitation learning and
goal-conditioned reinforcement learning. In either case, effective solutions
require the agent to reliably reach a specified state (a goal), or set of
states (a demonstration). Drawing a connection between probabilistic long-term
dynamics and the desired value function, this work introduces an approach which
utilizes recent advances in density estimation to effectively learn to reach a
given state. As our first contribution, we use this approach for
goal-conditioned reinforcement learning and show that it is both efficient and
does not suffer from hindsight bias in stochastic domains. As our second
contribution, we extend the approach to imitation learning and show that it
achieves state-of-the art demonstration sample-efficiency on standard benchmark
tasks.

该研究考虑了两种不同的学习方式：模仿学习和目标条件强化学习。该研究介绍了一种基于概率长期动态和期望价值函数之间联系的方法，并利用密度估计的最新进展来有效学习达到指定状态的能力。该方法不仅在目标条件强化学习方面表现高效且不会出现事后偏差问题，在模仿学习方面也达到了标准基准任务的最新样本效率。