We consider the problem of learning from observation (LfO), in which the
agent aims to mimic the expert's behavior from the state-only demonstrations by
experts. We additionally assume that the agent cannot interact with the
environment but has access to the action-labeled transition data collected by
some agents with unknown qualities. This offline setting for LfO is appealing
in many real-world scenarios where the ground-truth expert actions are
inaccessible and the arbitrary environment interactions are costly or risky. In
this paper, we present LobsDICE, an offline LfO algorithm that learns to
imitate the expert policy via optimization in the space of stationary
distributions. Our algorithm solves a single convex minimization problem, which
minimizes the divergence between the two state-transition distributions induced
by the expert and the agent policy. Through an extensive set of offline LfO
tasks, we show that LobsDICE outperforms strong baseline methods.

该论文探究了在缺乏专家行为数据的情况下，通过学习观察结果来模拟专家行为的问题，并提出了一种基于离线学习的算法 LobsDICE，该算法通过优化稳定分布来实现模仿专家策略，并在一系列在线学习任务中表现出色。