In this paper, we propose a novel method for learning reward functions directly from offline demonstrations. Unlike traditional inverse reinforcement learning (IRL), our approach decouples the reward function from the learner's policy, eliminating the adversarial interaction typically required between the two. This results in a more stable and efficient training process. Our reward function, called \textit{SR-Reward}, leverages successor representation (SR) to encode a state based on expected future states' visitation under the demonstration policy and transition dynamics. By utilizing the Bellman equation, SR-Reward can be learned concurrently with most reinforcement learning (RL) algorithms without altering the existing training pipeline. We also introduce a negative sampling strategy to mitigate overestimation errors by reducing rewards for out-of-distribution data, thereby enhancing robustness. This strategy inherently introduces a conservative bias into RL algorithms that employ the learned reward. We evaluate our method on the D4RL benchmark, achieving competitive results compared to offline RL algorithms with access to true rewards and imitation learning (IL) techniques like behavioral cloning. Moreover, our ablation studies on data size and quality reveal the advantages and limitations of SR-Reward as a proxy for true rewards.

本文提出了一种从离线示例中直接学习奖励函数的新方法，解决了传统逆强化学习中奖励函数与学习者策略的对抗互动问题。该方法利用后继表示（SR）编码状态，联合贝尔曼方程学习奖励函数，从而与强化学习算法并行训练，取得了与真实奖励的离线强化学习算法和模仿学习方法竞争的结果，同时展示了SR-奖励在稳定性和效率上的优势。

SR-奖励：走一条更常走的路