Offline reinforcement learning (RL) allows learning sequential behavior from
fixed datasets. Since offline datasets do not cover all possible situations,
many methods collect additional data during online fine-tuning to improve
performance. In general, these methods assume that the transition dynamics
remain the same during both the offline and online phases of training. However,
in many real-world applications, such as outdoor construction and navigation
over rough terrain, it is common for the transition dynamics to vary between
the offline and online phases. Moreover, the dynamics may vary during the
online fine-tuning. To address this problem of changing dynamics from offline
to online RL we propose a residual learning approach that infers dynamics
changes to correct the outputs of the offline solution. At the online
fine-tuning phase, we train a context encoder to learn a representation that is
consistent inside the current online learning environment while being able to
predict dynamic transitions. Experiments in D4RL MuJoCo environments, modified
to support dynamics' changes upon environment resets, show that our approach
can adapt to these dynamic changes and generalize to unseen perturbations in a
sample-efficient way, whilst comparison methods cannot.

离线强化学习通过离线数据集学习顺序行为，但实际应用中离线和在线阶段的转换动力学常常变化，所以提出了一种利用残差学习推断离线解决方案输出的动力学变化的方法，在在线微调阶段通过训练上下文编码器来学习能在当前在线学习环境中保持一致且能预测动态转换的表示，实验证明该方法适应这种动态变化，并可以以高样本利用率的方式推广到未见过的扰动。

自适应离线到在线强化学习的剩余学习和上下文编码

Residual Learning and Context Encoding for Adaptive Offline-to-Online  Reinforcement Learning

One-shot imitation is to learn a new task from a single demonstration, yet it
is a challenging problem to adopt it for complex tasks with the high domain
diversity inherent in a non-stationary environment. To tackle the problem, we
explore the compositionality of complex tasks, and present a novel skill-based
imitation learning framework enabling one-shot imitation and zero-shot
adaptation; from a single demonstration for a complex unseen task, a semantic
skill sequence is inferred and then each skill in the sequence is converted
into an action sequence optimized for environmental hidden dynamics that can
vary over time. Specifically, we leverage a vision-language model to learn a
semantic skill set from offline video datasets, where each skill is represented
on the vision-language embedding space, and adapt meta-learning with dynamics
inference to enable zero-shot skill adaptation. We evaluate our framework with
various one-shot imitation scenarios for extended multi-stage Meta-world tasks,
showing its superiority in learning complex tasks, generalizing to dynamics
changes, and extending to different demonstration conditions and modalities,
compared to other baselines.

通过探索复杂任务的组合性，我们提出了一种新颖的基于技能的模仿学习框架，实现了一次性模仿和零次适应，能够从单个演示中学习复杂任务，并针对随时间变化的环境隐藏动力学优化行动序列，通过视觉 - 语言模型学习语义技能集合，并使用动力学推断来实现零次技能适应。我们通过多个一次性模仿场景对我们的框架进行评估，展示了其在学习复杂任务、泛化动力学变化以及在不同演示条件和模态下的优越性，相比其他基线模型。