Learning from Observations (LfO) is a practical reinforcement learning
scenario from which many applications can benefit through the reuse of
incomplete resources. Compared to conventional imitation learning (IL), LfO is
more challenging because of the lack of expert action guidance. In both
conventional IL and LfO, distribution matching is at the heart of their
foundation. Traditional distribution matching approaches are sample-costly
which depend on on-policy transitions for policy learning. Towards
sample-efficiency, some off-policy solutions have been proposed, which,
however, either lack comprehensive theoretical justifications or depend on the
guidance of expert actions. In this work, we propose a sample-efficient LfO
approach that enables off-policy optimization in a principled manner. To
further accelerate the learning procedure, we regulate the policy update with
an inverse action model, which assists distribution matching from the
perspective of mode-covering. Extensive empirical results on challenging
locomotion tasks indicate that our approach is comparable with state-of-the-art
in terms of both sample-efficiency and asymptotic performance.

本文提出了一个基于观察学习的学习方法，包括分布匹配、离线策略学习和倒置动作模型，能够在性能和样本效率上与最先进的方法相媲美。

基于观测的离策略模仿学习

Off-Policy Imitation Learning from Observations

This paper studies Learning from Observations (LfO) for imitation learning
with access to state-only demonstrations. In contrast to Learning from
Demonstration (LfD) that involves both action and state supervision, LfO is
more practical in leveraging previously inapplicable resources (e.g. videos),
yet more challenging due to the incomplete expert guidance. In this paper, we
investigate LfO and its difference with LfD in both theoretical and practical
perspectives. We first prove that the gap between LfD and LfO actually lies in
the disagreement of inverse dynamics models between the imitator and the
expert, if following the modeling approach of GAIL. More importantly, the upper
bound of this gap is revealed by a negative causal entropy which can be
minimized in a model-free way. We term our method as
Inverse-Dynamics-Disagreement-Minimization (IDDM) which enhances the
conventional LfO method through further bridging the gap to LfD. Considerable
empirical results on challenging benchmarks indicate that our method attains
consistent improvements over other LfO counterparts.

本文研究了利用只有状态的演示进行模仿学习的观察学习（LfO）。通过理论和实践的角度，我们首先证明了如果遵循 GAIL 的建模方法，LfD 和 LfO 之间的差距实际上在于模仿者和专家之间的逆动力学模型的分歧。我们提出了 Inverse-Dynamics-Disagreement-Minimization（IDDM）方法，通过进一步缩小与 LfD 之间的差距来增强传统的 LfO 方法。挑战性基准测试的实证结果表明，我们的方法相对于其他 LfO 方法获得了一致的改进。