Recent developments in offline reinforcement learning have uncovered the
immense potential of diffusion modeling, which excels at representing
heterogeneous behavior policies. However, sampling from diffusion policies is
considerably slow because it necessitates tens to hundreds of iterative
inference steps for one action. To address this issue, we propose to extract an
efficient deterministic inference policy from critic models and pretrained
diffusion behavior models, leveraging the latter to directly regularize the
policy gradient with the behavior distribution's score function during
optimization. Our method enjoys powerful generative capabilities of diffusion
modeling while completely circumventing the computationally intensive and
time-consuming diffusion sampling scheme, both during training and evaluation.
Extensive results on D4RL tasks show that our method boosts action sampling
speed by more than 25 times compared with various leading diffusion-based
methods in locomotion tasks, while still maintaining state-of-the-art
performance.

我们提出了一种从评论家模型和预训练的扩散行为模型中有效地提取确定性推理策略的方法，利用后者在优化过程中直接规范化行为分布的评分函数，从而在训练和评估期间完全避免计算密集型和耗时的扩散采样方案，扩散建模的强大生成能力使我们的方法在 D4RL 任务上将行动采样速度提高了 25 倍以上，同时仍保持着最先进的性能。