Recently, reward-conditioned reinforcement learning (RCRL) has gained
popularity due to its simplicity, flexibility, and off-policy nature. However,
we will show that current RCRL approaches are fundamentally limited and fail to
address two critical challenges of RCRL -- improving generalization on high
reward-to-go (RTG) inputs, and avoiding out-of-distribution (OOD) RTG queries
during testing time. To address these challenges when training vanilla RCRL
architectures, we propose Bayesian Reparameterized RCRL (BR-RCRL), a novel set
of inductive biases for RCRL inspired by Bayes' theorem. BR-RCRL removes a core
obstacle preventing vanilla RCRL from generalizing on high RTG inputs -- a
tendency that the model treats different RTG inputs as independent values,
which we term ``RTG Independence". BR-RCRL also allows us to design an
accompanying adaptive inference method, which maximizes total returns while
avoiding OOD queries that yield unpredictable behaviors in vanilla RCRL
methods. We show that BR-RCRL achieves state-of-the-art performance on the
Gym-Mujoco and Atari offline RL benchmarks, improving upon vanilla RCRL by up
to 11%.

提出了一种名为 Bayesian Reparameterized RCRL（BR-RCRL）的奖励条件强化学习新方法，它通过消除强化学习在高奖励输入下的独立性偏见和处理预测行为射线分布的问题，取得了比传统方法高出 11% 的性能在 Gym-Mujoco 和 Atari 离线 RL 基准中.