Behavior constrained policy optimization has been demonstrated to be a
successful paradigm for tackling Offline Reinforcement Learning. By exploiting
historical transitions, a policy is trained to maximize a learned value
function while constrained by the behavior policy to avoid a significant
distributional shift. In this paper, we propose our closed-form policy
improvement operators. We make a novel observation that the behavior constraint
naturally motivates the use of first-order Taylor approximation, leading to a
linear approximation of the policy objective. Additionally, as practical
datasets are usually collected by heterogeneous policies, we model the behavior
policies as a Gaussian Mixture and overcome the induced optimization
difficulties by leveraging the LogSumExp's lower bound and Jensen's Inequality,
giving rise to a closed-form policy improvement operator. We instantiate
offline RL algorithms with our novel policy improvement operators and
empirically demonstrate their effectiveness over state-of-the-art algorithms on
the standard D4RL benchmark.

本研究提出了行为约束的策略优化方法，通过模拟历史状态转移，采用先进的算法，成功地实现了通过行为约束进行离线强化学习。研究中，我们提出了闭式策略改进算子。我们首次发现，行为约束自然促使使用一阶泰勒展开，从而线性逼近策略目标。此外，由于实际数据通常由异构策略收集，因此我们将行为策略建模为高斯混合，并通过利用 LogSumExp 的下界和 Jensen 不等式克服引入的优化困难，得到闭式策略改进算子。我们使用这种新颖的策略改进算子实例化离线 RL 算法，并在标准 D4RL 基准测试上成功地实验验证了其有效性。