Recent progress in state-only imitation learning extends the scope of applicability of imitation learning to real-world settings by relieving the need for observing expert actions. However, existing solutions only learn to extract a state-to-action mapping policy from the data, without considering how the expert plans to the target. This hinders the ability to leverage demonstrations and limits the flexibility of the policy. In this paper, we introduce Decoupled Policy Optimization (DePO), which explicitly decouples the policy as a high-level state planner and an inverse dynamics model. With embedded decoupled policy gradient and generative adversarial training, DePO enables knowledge transfer to different action spaces or state transition dynamics, and can generalize the planner to out-of-demonstration state regions. Our in-depth experimental analysis shows the effectiveness of DePO on learning a generalized target state planner while achieving the best imitation performance. We demonstrate the appealing usage of DePO for transferring across different tasks by pre-training, and the potential for co-training agents with various skills.

本論文提出了一种名为DePO（Decoupled Policy Optimization）的方法，該方法將策略分解為高層次的狀態規劃器和反動力學模型，并利用嵌入式的解耦策略梯度方法和对抗式训练对其进行优化，此方法能够在不同的操作空间或状态转移动态下实现知识传输，且可以将规划器泛化到演示区域之外的状态区域，达到学习泛化的目的。最后的实验结果表明，本文提出的DePO方法能够在不失精度的情况下，加强知识的传输性和泛化性，能够有效地学习泛化的目标状态规划器。

规划目标，学习技能：通过分离的策略优化实现可迁移的状态-仅模仿学习