In many real-world settings, agents must learn from an offline dataset
gathered by some prior behavior policy. Such a setting naturally leads to
distribution shift between the behavior policy and the target policy being
trained - requiring policy conservatism to avoid instability and overestimation
bias. Autoregressive world models offer a different solution to this by
generating synthetic, on-policy experience. However, in practice, model
rollouts must be severely truncated to avoid compounding error. As an
alternative, we propose policy-guided diffusion. Our method uses diffusion
models to generate entire trajectories under the behavior distribution,
applying guidance from the target policy to move synthetic experience further
on-policy. We show that policy-guided diffusion models a regularized form of
the target distribution that balances action likelihood under both the target
and behavior policies, leading to plausible trajectories with high target
policy probability, while retaining a lower dynamics error than an offline
world model baseline. Using synthetic experience from policy-guided diffusion
as a drop-in substitute for real data, we demonstrate significant improvements
in performance across a range of standard offline reinforcement learning
algorithms and environments. Our approach provides an effective alternative to
autoregressive offline world models, opening the door to the controllable
generation of synthetic training data.

我们提出了一种使用扩散模型生成基于行为分布的整个轨迹，并通过目标策略引导将合成经验转移到更贴近目标策略的方法，以此代替真实数据进行离线强化学习，并在各种标准离线强化学习算法和环境中取得显著的性能改善。