Offline reinforcement learning (RL) defines a sample-efficient learning
paradigm, where a policy is learned from static and previously collected
datasets without additional interaction with the environment. The major
obstacle to offline RL is the estimation error arising from evaluating the
value of out-of-distribution actions. To tackle this problem, most existing
offline RL methods attempt to acquire a policy both ``close" to the behaviors
contained in the dataset and sufficiently improved over them, which requires a
trade-off between two possibly conflicting targets. In this paper, we propose a
novel approach, which we refer to as adaptive behavior regularization (ABR), to
balance this critical trade-off. By simply utilizing a sample-based
regularization, ABR enables the policy to adaptively adjust its optimization
objective between cloning and improving over the policy used to generate the
dataset. In the evaluation on D4RL datasets, a widely adopted benchmark for
offline reinforcement learning, ABR can achieve improved or competitive
performance compared to existing state-of-the-art algorithms.

本文提出了自适应行为正则化（ABR）的方法改善已有机器学习数据集中存在的行为采样偏差，从而提高了离线强化学习的效率和稳定性，并在 D4RL 数据集上实现了最新算法中更好或相当的性能。