Generative models, particularly diffusion models, have achieved remarkable success in density estimation for multimodal data, drawing significant interest from the reinforcement learning (RL) community, especially in policy modeling in continuous action spaces. However, existing works exhibit significant variations in training schemes and RL optimization objectives, and some methods are only applicable to diffusion models. In this study, we compare and analyze various generative policy training and deployment techniques, identifying and validating effective designs for generative policy algorithms. Specifically, we revisit existing training objectives and classify them into two categories, each linked to a simpler approach. The first approach, Generative Model Policy Optimization (GMPO), employs a native advantage-weighted regression formulation as the training objective, which is significantly simpler than previous methods. The second approach, Generative Model Policy Gradient (GMPG), offers a numerically stable implementation of the native policy gradient method. We introduce a standardized experimental framework named GenerativeRL. Our experiments demonstrate that the proposed methods achieve state-of-the-art performance on various offline-RL datasets, offering a unified and practical guideline for training and deploying generative policies.

本研究针对当前生成模型在强化学习中的应用，尤其是在政策建模中存在的训练方案和优化目标的差异，提出了改进方法。文章提出了两种新的训练目标，分别为生成模型政策优化（GMPO）和生成模型策略梯度（GMPG），并在标准化实验框架下验证其有效性，展示了在多种离线强化学习数据集上具有先进的性能，提供了生成策略训练和部署的统一指导。

重访生成策略：简单的强化学习算法视角