Recent methods for imitation learning directly learn a $Q$-function using an
implicit reward formulation rather than an explicit reward function. However,
these methods generally require implicit reward regularization to improve
stability and often mistreat absorbing states. Previous works show that a
squared norm regularization on the implicit reward function is effective, but
do not provide a theoretical analysis of the resulting properties of the
algorithms. In this work, we show that using this regularizer under a mixture
distribution of the policy and the expert provides a particularly illuminating
perspective: the original objective can be understood as squared Bellman error
minimization, and the corresponding optimization problem minimizes a bounded
$\chi^2$-Divergence between the expert and the mixture distribution. This
perspective allows us to address instabilities and properly treat absorbing
states. We show that our method, Least Squares Inverse Q-Learning (LS-IQ),
outperforms state-of-the-art algorithms, particularly in environments with
absorbing states. Finally, we propose to use an inverse dynamics model to learn
from observations only. Using this approach, we retain performance in settings
where no expert actions are available.

本文研究了使用正则化的隐式奖励函数来解决穿透状态和不稳定性问题，提出了一种新的方法，即最小二乘逆 Q 学习方法 (LS-IQ)，在关键领域取得了最好的性能，特别是在存在穿透状态的环境中。并且我们提出使用逆动力学模型来仅仅通过观察就开始学习。