Most prior approaches to offline reinforcement learning (RL) have taken an
iterative actor-critic approach involving off-policy evaluation. In this paper
we show that simply doing one step of constrained/regularized policy
improvement using an on-policy Q estimate of the behavior policy performs
surprisingly well. This one-step algorithm beats the previously reported
results of iterative algorithms on a large portion of the D4RL benchmark. The
one-step baseline achieves this strong performance while being notably simpler
and more robust to hyperparameters than previously proposed iterative
algorithms. We argue that the relatively poor performance of iterative
approaches is a result of the high variance inherent in doing off-policy
evaluation and magnified by the repeated optimization of policies against those
estimates. In addition, we hypothesize that the strong performance of the
one-step algorithm is due to a combination of favorable structure in the
environment and behavior policy.

本文探讨了离线强化学习领域中的一个策略改进方法，使用 on-policy Q 估计的行为策略，通过一步有限制 / 正则化的策略改进，能在 D4RL 基准测试中表现优于迭代算法。我们认为，迭代算法的性能较差是由于进行 off-policy 评估所固有的高方差以及相对较差的行为策略等原因所导致的。