Reinforcement learning (RL) has proven highly effective in addressing complex
decision-making and control tasks. However, in most traditional RL algorithms,
the policy is typically parameterized as a diagonal Gaussian distribution with
learned mean and variance, which constrains their capability to acquire complex
policies. In response to this problem, we propose an online RL algorithm termed
diffusion actor-critic with entropy regulator (DACER). This algorithm
conceptualizes the reverse process of the diffusion model as a novel policy
function and leverages the capability of the diffusion model to fit multimodal
distributions, thereby enhancing the representational capacity of the policy.
Since the distribution of the diffusion policy lacks an analytical expression,
its entropy cannot be determined analytically. To mitigate this, we propose a
method to estimate the entropy of the diffusion policy utilizing Gaussian
mixture model. Building on the estimated entropy, we can learn a parameter
$\alpha$ that modulates the degree of exploration and exploitation. Parameter
$\alpha$ will be employed to adaptively regulate the variance of the added
noise, which is applied to the action output by the diffusion model.
Experimental trials on MuJoCo benchmarks and a multimodal task demonstrate that
the DACER algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo
control tasks while exhibiting a stronger representational capacity of the
diffusion policy.

提出了一种名为 DACER 的在线强化学习算法，通过利用扩散模型的能力来拟合多模态分布，增强策略的表征能力，并提出了一种估算扩散策略熵的方法，在 MuJoCo 基准和多模态任务上实验证明了算法的最先进性能。

扩散演员 - 评论者与熵调节器

Diffusion Actor-Critic with Entropy Regulator

We propose a new policy parameterization for representing 3D rotations during
reinforcement learning. Today in the continuous control reinforcement learning
literature, many stochastic policy parameterizations are Gaussian. We argue
that universally applying a Gaussian policy parameterization is not always
desirable for all environments. One such case in particular where this is true
are tasks that involve predicting a 3D rotation output, either in isolation, or
coupled with translation as part of a full 6D pose output. Our proposed Bingham
Policy Parameterization (BPP) models the Bingham distribution and allows for
better rotation (quaternion) prediction over a Gaussian policy parameterization
in a range of reinforcement learning tasks. We evaluate BPP on the rotation
Wahba problem task, as well as a set of vision-based next-best pose robot
manipulation tasks from RLBench. We hope that this paper encourages more
research into developing other policy parameterization that are more suited for
particular environments, rather than always assuming Gaussian.

提出了一种新的策略参数化方式，Bingham Policy Parameterization（BPP），它可以更好地模拟 Bingham 分布，从而比高斯策略参数化在一系列强化学习任务中具有更好的旋转（四元数）预测能力。