In this paper, we focus on single-demonstration imitation learning (IL), a
practical approach for real-world applications where obtaining numerous expert
demonstrations is costly or infeasible. In contrast to typical IL settings with
multiple demonstrations, single-demonstration IL involves an agent having
access to only one expert trajectory. We highlight the issue of sparse reward
signals in this setting and propose to mitigate this issue through our proposed
Transition Discriminator-based IL (TDIL) method. TDIL is an IRL method designed
to address reward sparsity by introducing a denser surrogate reward function
that considers environmental dynamics. This surrogate reward function
encourages the agent to navigate towards states that are proximal to expert
states. In practice, TDIL trains a transition discriminator to differentiate
between valid and non-valid transitions in a given environment to compute the
surrogate rewards. The experiments demonstrate that TDIL outperforms existing
IL approaches and achieves expert-level performance in the single-demonstration
IL setting across five widely adopted MuJoCo benchmarks as well as the "Adroit
Door" environment.

单个示范模仿学习浅层奖励问题通过过渡判别基于 IL 方法得到缓解，在五个广泛采用的 MuJoCo 基准测试以及 “灵巧门” 环境中，该方法胜过现有的 IL 方法且达到专家级性能。

用于单示范模仿学习的专家接近度作为替代奖励

Expert Proximity as Surrogate Rewards for Single Demonstration Imitation  Learning

Reinforcement Learning has emerged as a strong alternative to solve
optimization tasks efficiently. The use of these algorithms highly depends on
the feedback signals provided by the environment in charge of informing about
how good (or bad) the decisions made by the learned agent are. Unfortunately,
in a broad range of problems the design of a good reward function is not
trivial, so in such cases sparse reward signals are instead adopted. The lack
of a dense reward function poses new challenges, mostly related to exploration.
Imitation Learning has addressed those problems by leveraging demonstrations
from experts. In the absence of an expert (and its subsequent demonstrations),
an option is to prioritize well-suited exploration experiences collected by the
agent in order to bootstrap its learning process with good exploration
behaviors. However, this solution highly depends on the ability of the agent to
discover such trajectories in the early stages of its learning process. To
tackle this issue, we propose to combine imitation learning with intrinsic
motivation, two of the most widely adopted techniques to address problems with
sparse reward. In this work intrinsic motivation is used to encourage the agent
to explore the environment based on its curiosity, whereas imitation learning
allows repeating the most promising experiences to accelerate the learning
process. This combination is shown to yield an improved performance and better
generalization in procedurally-generated environments, outperforming previously
reported self-imitation learning methods and achieving equal or better sample
efficiency with respect to intrinsic motivation in isolation.

本文提出将内在动机与模仿学习相结合来优化探索行为，以解决在广泛应用的问题中由于奖励信号过于稀疏所带来的挑战，同时证明了在过程生成环境中，该方法可以取得优异的性能和更好的泛化能力，效率同等或更高。

自我模仿学习中基于内在动机的探索改进方法研究

Towards Improving Exploration in Self-Imitation Learning using Intrinsic Motivation

We propose Scheduled Auxiliary Control (SAC-X), a new learning paradigm in
the context of Reinforcement Learning (RL). SAC-X enables learning of complex
behaviors - from scratch - in the presence of multiple sparse reward signals.
To this end, the agent is equipped with a set of general auxiliary tasks, that
it attempts to learn simultaneously via off-policy RL. The key idea behind our
method is that active (learned) scheduling and execution of auxiliary policies
allows the agent to efficiently explore its environment - enabling it to excel
at sparse reward RL. Our experiments in several challenging robotic
manipulation settings demonstrate the power of our approach.

本文介绍了计划辅助控制 (SAC-X)，一种新的强化学习学习范例，它可以在多重稀疏奖励信号存在的情况下从零开始学习复杂的行为，并在具有挑战性的机器人控制环境中得到了实验证明。