One-shot imitation is to learn a new task from a single demonstration, yet it is a challenging problem to adopt it for complex tasks with the high domain diversity inherent in a non-stationary environment. To tackle the problem, we explore the compositionality of complex tasks, and present a novel skill-based imitation learning framework enabling one-shot imitation and zero-shot adaptation; from a single demonstration for a complex unseen task, a semantic skill sequence is inferred and then each skill in the sequence is converted into an action sequence optimized for environmental hidden dynamics that can vary over time. Specifically, we leverage a vision-language model to learn a semantic skill set from offline video datasets, where each skill is represented on the vision-language embedding space, and adapt meta-learning with dynamics inference to enable zero-shot skill adaptation. We evaluate our framework with various one-shot imitation scenarios for extended multi-stage Meta-world tasks, showing its superiority in learning complex tasks, generalizing to dynamics changes, and extending to different demonstration conditions and modalities, compared to other baselines.

通过探索复杂任务的组合性，我们提出了一种新颖的基于技能的模仿学习框架，实现了一次性模仿和零次适应，能够从单个演示中学习复杂任务，并针对随时间变化的环境隐藏动力学优化行动序列，通过视觉-语言模型学习语义技能集合，并使用动力学推断来实现零次技能适应。我们通过多个一次性模仿场景对我们的框架进行评估，展示了其在学习复杂任务、泛化动力学变化以及在不同演示条件和模态下的优越性，相比其他基线模型。

非静态环境下的多模态技能单次模仿