In this paper, we study offline-to-online Imitation Learning (IL) that
pretrains an imitation policy from static demonstration data, followed by fast
finetuning with minimal environmental interaction. We find the na\"ive
combination of existing offline IL and online IL methods tends to behave poorly
in this context, because the initial discriminator (often used in online IL)
operates randomly and discordantly against the policy initialization, leading
to misguided policy optimization and $\textit{unlearning}$ of pretraining
knowledge. To overcome this challenge, we propose a principled
offline-to-online IL method, named $\texttt{OLLIE}$, that simultaneously learns
a near-expert policy initialization along with an $\textit{aligned
discriminator initialization}$, which can be seamlessly integrated into online
IL, achieving smooth and fast finetuning. Empirically, $\texttt{OLLIE}$
consistently and significantly outperforms the baseline methods in
$\textbf{20}$ challenging tasks, from continuous control to vision-based
domains, in terms of performance, demonstration efficiency, and convergence
speed. This work may serve as a foundation for further exploration of
pretraining and finetuning in the context of IL.

这篇论文研究了离线到在线模仿学习（IL），该方法从静态示范数据中预训练一个模仿策略，然后通过最小的环境交互快速微调。通过研究发现现有的离线 IL 和在线 IL 方法的原始组合在这个情景下表现不佳，因为初始鉴别器（通常在在线 IL 中使用）随机运作和不一致地反对策略初始化，导致了策略优化的误导和预训练知识的遗忘。为了克服这个挑战，提出了一种有原则的离线到在线 IL 方法，称为 OLLIE，它同时学习了接近专家策略初始化和对齐的鉴别器初始化，可以无缝地集成到在线 IL 中，实现平稳快速的微调。经验上，在连续控制到视觉领域的 20 个具有挑战性的任务中，OLLIE 在性能、示范效率和收敛速度方面始终显著优于基线方法。该工作可能为进一步探索模仿学习中的预训练和微调奠定基础。

OLLIE: 离线预训练到在线微调的模仿学习

OLLIE: Imitation Learning from Offline Pretraining to Online Finetuning

It is common in deep learning to train networks on auxiliary tasks with the
expectation that the learning will transfer, at least partially, to another
task of interest. In this work, we investigate the inductive biases that result
from learning auxiliary tasks, either simultaneously (multi-task learning, MTL)
or sequentially (pretraining and subsequent finetuning, PT+FT). In the
simplified setting of two-layer diagonal linear networks trained with gradient
descent, we identify implicit regularization penalties associated with MTL and
PT+FT, both of which incentivize feature sharing between tasks and sparsity in
learned task-specific features. Notably, our results imply that during
finetuning, networks operate in a hybrid of the kernel (or "lazy") regime and
the feature learning ("rich") regime identified in prior work. Moreover, PT+FT
can exhibit a novel "nested feature learning" behavior not captured by either
regime, which biases it to extract a sparse subset of the features learned
during pretraining. In ReLU networks, we reproduce all of these qualitative
behaviors. We also observe that PT+FT (but not MTL) is biased to learn features
that are correlated with (but distinct from) those needed for the auxiliary
task, while MTL is biased toward using identical features for both tasks. As a
result, we find that in realistic settings, MTL generalizes better when
comparatively little data is available for the task of interest, while PT+FT
outperforms it with more data available. We show that our findings hold
qualitatively for a deep architecture trained on image classification tasks.
Our characterization of the nested feature learning regime also motivates a
modification to PT+FT that we find empirically improves performance. Overall,
our results shed light on the impact of auxiliary task learning and suggest
ways to leverage it more effectively.

通过研究辅助任务的学习，我们发现其对特征共享和任务特定特征的稀疏性进行了激励，同时提出了一种修改了预训练和微调方法以提高性能的技术。