Existing imitation learning (IL) methods such as inverse reinforcement
learning (IRL) usually have a double-loop training process, alternating between
learning a reward function and a policy and tend to suffer long training time
and high variance. In this work, we identify the benefits