Offline Reinforcement Learning (RL) aims at learning an optimal control from a fixed dataset, without interactions with the system. An agent in this setting should avoid selecting actions whose consequences cannot be predicted from the data. This is the converse of exploration in RL, which favors such actions. We thus take inspiration from the literature on bonus-based exploration to design a new offline RL agent. The core idea is to subtract a prediction-based exploration bonus from the reward, instead of adding it for exploration. This allows the policy to stay close to the support of the dataset. We connect this approach to a more common regularization of the learned policy towards the data. Instantiated with a bonus based on the prediction error of a variational autoencoder, we show that our agent is competitive with the state of the art on a set of continuous control locomotion and manipulation tasks.

本研究提出了一种新的离线强化学习代理，将基于奖励的勘探法的探索奖励从奖励中减去，以使策略保持在数据集的支持范围内，并连接该方法到对学习策略向数据集的普遍约束的正则化，通过基于变分自动编码器的预测误差的奖励进行实例化，证明了该代理在一组连续控制运动和操作任务的状态下存在竞争力。

离线强化学习作为反探索策略