DDPG is hindered by the overestimation bias problem, wherein its
$Q$-estimates tend to overstate the actual $Q$-values. Traditional solutions to
this bias involve ensemble-based methods, which require significant
computational resources, or complex log-policy-based approaches, which are
difficult to understand and implement. In contrast, we propose a
straightforward solution using a $Q$-target and incorporating a behavioral
cloning (BC) loss penalty. This solution, acting as an uncertainty measure, can
be easily implemented with minimal code and without the need for an ensemble.
Our empirical findings strongly support the superiority of Conservative DDPG
over DDPG across various MuJoCo and Bullet tasks. We consistently observe
better performance in all evaluated tasks and even competitive or superior
performance compared to TD3 and TD7, all achieved with significantly reduced
computational requirements.

DDPG 面临过度估计偏差问题，而传统解决方法涉及到需要大量计算资源的基于集成的方法或难以理解和实现的复杂对数策略方法。相比之下，我们提出了一种简单的解决方案，使用一个 $Q$-target 并结合行为克隆（BC）损失惩罚作为不确定性度量，这种解决方案可以在最小的代码实现中轻松使用，而无需集成。我们的实证结果强烈支持 Conservative DDPG 在各种 MuJoCo 和 Bullet 任务中相对于 DDPG 的优越性能，我们在所有评估任务中一致观察到更好的性能，甚至与 TD3 和 TD7 相比也表现出竞争力或更优越的性能，并且大大降低了计算需求。