Unlike most reinforcement learning agents which require an unrealistic amount
of environment interactions to learn a new behaviour, humans excel at learning
quickly by merely observing and imitating others. This ability highly depends
on the fact that humans have a model of their own embodiment that allows them
to infer the most likely actions that led to the observed behaviour. In this
paper, we propose Action Inference by Maximising Evidence (AIME) to replicate
this behaviour using world models. AIME consists of two distinct phases. In the
first phase, the agent learns a world model from its past experience to
understand its own body by maximising the ELBO. While in the second phase, the
agent is given some observation-only demonstrations of an expert performing a
novel task and tries to imitate the expert's behaviour. AIME achieves this by
defining a policy as an inference model and maximising the evidence of the
demonstration under the policy and world model. Our method is "zero-shot" in
the sense that it does not require further training for the world model or
online interactions with the environment after given the demonstration. We
empirically validate the zero-shot imitation performance of our method on the
Walker and Cheetah embodiment of the DeepMind Control Suite and find it
outperforms the state-of-the-art baselines. Code is available at:
this https URL

通过最大化证据的行动推理（AIME）和世界模型，在不需要进一步训练或与环境在线交互的情况下，通过观察和模仿他人的行为实现零 - shot 模型的人工智能代理。

通过最大化证据推断行动：从观察中无标签模仿的世界模型

Action Inference by Maximising Evidence: Zero-Shot Imitation from  Observation with World Models

We study the question of how to imitate tasks across domains with
discrepancies such as embodiment, viewpoint, and dynamics mismatch. Many prior
works require paired, aligned demonstrations and an additional RL step that
requires environment interactions. However, paired, aligned demonstrations are
seldom obtainable and RL procedures are expensive. We formalize the Domain
Adaptive Imitation Learning (DAIL) problem, which is a unified framework for
imitation learning in the presence of viewpoint, embodiment, and dynamics
mismatch. Informally, DAIL is the process of learning how to perform a task
optimally, given demonstrations of the task in a distinct domain. We propose a
two step approach to DAIL: alignment followed by adaptation. In the alignment
step we execute a novel unsupervised MDP alignment algorithm, Generative
Adversarial MDP Alignment (GAMA), to learn state and action correspondences
from \emph{unpaired, unaligned} demonstrations. In the adaptation step we
leverage the correspondences to zero-shot imitate tasks across domains. To
describe when DAIL is feasible via alignment and adaptation, we introduce a
theory of MDP alignability. We experimentally evaluate GAMA against baselines
in embodiment, viewpoint, and dynamics mismatch scenarios where aligned
demonstrations don't exist and show the effectiveness of our approach.

介绍了一种领域自适应模仿学习（DAIL）问题的形式化表达，提出了一种两步方法：对齐和适应，通过无监督的 MDP 对齐算法（GAMA）执行对齐步骤并利用对应关系在不需要经过环境交互的情况下在不同领域中实现任务零次模仿。