In many real-world settings, an agent must learn to act in environments where
no reward signal can be specified, but a set of expert demonstrations is
available. Imitation learning (IL) is a popular framework for learning policies
from such demonstrations. However, in some cases, differences in observability
between the expert and the agent can give rise to an imitation gap such that
the expert's policy is not optimal for the agent and a naive application of IL
can fail catastrophically. In particular, if the expert observes the Markov
state and the agent does not, then the expert will not demonstrate the
information-gathering behavior needed by the agent but not the expert. In this
paper, we propose a Bayesian solution to the Imitation Gap (BIG), first using
the expert demonstrations, together with a prior specifying the cost of
exploratory behavior that is not demonstrated, to infer a posterior over
rewards with Bayesian inverse reinforcement learning (IRL). BIG then uses the
reward posterior to learn a Bayes-optimal policy. Our experiments show that
BIG, unlike IL, allows the agent to explore at test time when presented with an
imitation gap, whilst still learning to behave optimally using expert
demonstrations when no such gap exists.

在缺乏奖励信号的环境中，我们提出了一种基于贝叶斯的解决方案（BIG），通过使用专家演示和指定未演示的探索性行为成本的先验，来推断贝叶斯逆强化学习（IRL）中的奖励后验，从而学习到基于贝叶斯的最优策略。我们的实验表明，BIG 能够在测试时适应模仿差距，同时在不存在模仿差距时仍能通过专家演示学习到最优行为。

贝叶斯解决模仿间隙

A Bayesian Solution To The Imitation Gap

Despite the availability of ever more data enabled through modern sensor and
computer technology, it still remains an open problem to learn dynamical
systems in a sample-efficient way. We propose active learning strategies that
leverage information-theoretical properties arising naturally during Gaussian
process regression, while respecting constraints on the sampling process
imposed by the system dynamics. Sample points are selected in regions with high
uncertainty, leading to exploratory behavior and data-efficient training of the
model. All results are finally verified in an extensive numerical benchmark.

本文提出了利用高斯过程回归中自然产生的信息理论特性的主动学习策略，尊重系统动态 imposed 约束下的抽样过程，并在高不确定度区域选择抽样点，以实现探索性行为和数据高效训练。该方法在大量数值基准测试中得到验证。