Offline reinforcement learning aims to utilize datasets of previously
gathered environment-action interaction records to learn a policy without
access to the real environment. Recent work has shown that offline
reinforcement learning can be formulated as a sequence modeling problem and
solved via supervised learning with approaches such as decision transformer.
While these sequence-based methods achieve competitive results over
return-to-go methods, especially on tasks that require longer episodes or with
scarce rewards, importance sampling is not considered to correct the policy
bias when dealing with off-policy data, mainly due to the absence of behavior
policy and the use of deterministic evaluation policies. To this end, we
propose DPE: an RL algorithm that blends offline sequence modeling and offline
reinforcement learning with Double Policy Estimation (DPE) in a unified
framework with statistically proven properties on variance reduction. We
validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks. Our
method brings a performance improvements on selected methods which outperforms
SOTA baselines in several tasks, demonstrating the advantages of enabling
double policy estimation for sequence-modeled reinforcement learning.

提出了一种利用线下序列建模和线下强化学习相结合的双策略估计 (DPE) 的强化学习算法，具有统计上证明的方差降低性质，应用于多个 OpenAI Gym 中的任务，并在 D4RL 基准测试中取得了性能改进，优于基线方法，展示了序列建模强化学习中双策略估计的优势。