Offline reinforcement learning (RL) aims at learning an optimal policy from a batch of collected data, without extra interactions with the environment during training. Offline RL attempts to alleviate the hazardous executions in environments, thus it will greatly broaden the scope of RL applications. However, current offline RL benchmarks commonly have a large reality gap. They involve large datasets collected by highly exploratory policies, and a trained policy is directly evaluated in the environment. Meanwhile, in real-world situations, running a highly exploratory policy is prohibited to ensure system safety, the data is commonly very limited, and a trained policy should be well validated before deployment. In this paper, we present a suite of near real-world benchmarks, NewRL. NewRL contains datasets from various domains with controlled sizes and extra test datasets for the purpose of policy validation. We then evaluate existing offline RL algorithms on NewRL. In the experiments, we argue that the performance of a policy should also be compared with the deterministic version of the behavior policy, instead of the dataset reward. Because the deterministic behavior policy is the baseline in the real scenarios, while the dataset is often collected with action perturbations that can degrade the performance. The empirical results demonstrate that the tested offline RL algorithms appear only competitive to the above deterministic policy on many datasets, and the offline policy evaluation hardly helps. The NewRL suit can be found at http://polixir.ai/research/newrl. We hope this work will shed some light on research and draw more attention when deploying RL in real-world systems.

本文提出了一个名为NeoRL的近实际场景离线强化学习基准，对现有的离线RL算法进行了评估，并提出了应该将策略的性能与确定性行为策略版本相比较，从而在现实中实现RL技术应用的验证和部署。

NeoRL: 一种近似于真实环境的离线强化学习基准