Inverse Reinforcement Learning (IRL) is a powerful framework for learning
complex behaviors from expert demonstrations. However, it traditionally
requires repeatedly solving a computationally expensive reinforcement learning
(RL) problem in its inner loop. It is desirable to reduce the exploration
burden by leveraging expert demonstrations in the inner-loop RL. As an example,
recent work resets the learner to expert states in order to inform the learner
of high-reward expert states. However, such an approach is infeasible in the
real world. In this work, we consider an alternative approach to speeding up
the RL subroutine in IRL: \emph{pessimism}, i.e., staying close to the expert's
data distribution, instantiated via the use of offline RL algorithms. We
formalize a connection between offline RL and IRL, enabling us to use an
arbitrary offline RL algorithm to improve the sample efficiency of IRL. We
validate our theory experimentally by demonstrating a strong correlation
between the efficacy of an offline RL algorithm and how well it works as part
of an IRL procedure. By using a strong offline RL algorithm as part of an IRL
procedure, we are able to find policies that match expert performance
significantly more efficiently than the prior art.

通过使用离线 RL 算法作为 IRL 过程的一部分，我们能够更有效地找到与专家表现相匹配的策略。

逆强化学习中悲观主义的优点

The Virtues of Pessimism in Inverse Reinforcement Learning

Inverse reinforcement learning (IRL) is computationally challenging, with
common approaches requiring the solution of multiple reinforcement learning
(RL) sub-problems. This work motivates the use of potential-based reward
shaping to reduce the computational burden of each RL sub-problem. This work
serves as a proof-of-concept and we hope will inspire future developments
towards computationally efficient IRL.

逆向强化学习是具有计算挑战性的，常见方法需要解决多个强化学习子问题。本研究激励使用基于潜力的奖励塑造来减轻每个强化学习子问题的计算负担，并希望能激发未来对计算效率高的逆向强化学习的发展。