Adversarial inverse reinforcement learning (AIRL) stands as a cornerstone approach in imitation learning. This paper rethinks the two different angles of AIRL: policy imitation and transferable reward recovery. We begin with substituting the built-in algorithm in AIRL with soft actor-critic (SAC) during the policy optimization process to enhance sample efficiency, thanks to the off-policy formulation of SAC and identifiable Markov decision process (MDP) models with respect to AIRL. It indeed exhibits a significant improvement in policy imitation but accidentally brings drawbacks to transferable reward recovery. To learn this issue, we illustrate that the SAC algorithm itself is not feasible to disentangle the reward function comprehensively during the AIRL training process, and propose a hybrid framework, PPO-AIRL + SAC, for satisfactory transfer effect. Additionally, we analyze the capability of environments to extract disentangled rewards from an algebraic theory perspective.

在这篇研究论文中，我们重新思考了对抗性逆向强化学习 (AIRL) 的两个不同方面：策略模仿和可转移奖励恢复。我们使用软actor-critic (SAC) 在策略优化过程中替代了AIRL中的内置算法，以增强样本效率，并且可识别AIRL相对于SAC具有马尔可夫决策过程 (MDP) 模型。这确实显着提高了策略模仿，但不幸的是对可转移奖励恢复带来了一些不利影响。为了解决这个问题，我们指出SAC算法本身无法在AIRL训练过程中全面解藕奖励函数，并且提出了一个混合框架PPO-AIRL + SAC，以实现满意的转移效果。此外，我们还从代数理论的角度分析了环境从中提取解藕奖励的能力。

重新思考对抗性逆强化学习：从策略模仿和可转移奖励恢复的角度