Discrete-action reinforcement learning algorithms often falter in tasks with
high-dimensional discrete action spaces due to the vast number of possible
actions. A recent advancement leverages value-decomposition, a concept from
multi-agent reinforcement learning, to tackle this challenge. This study delves
deep into the effects of this value-decomposition, revealing that whilst it
curtails the over-estimation bias inherent to Q-learning algorithms, it
amplifies target variance. To counteract this, we present an ensemble of
critics to mitigate target variance. Moreover, we introduce a regularisation
loss that helps to mitigate the effects that exploratory actions in one
dimension can have on the value of optimal actions in other dimensions. Our
novel algorithm, REValueD, tested on discretised versions of the DeepMind
Control Suite tasks, showcases superior performance, especially in the
challenging humanoid and dog tasks. We further dissect the factors influencing
REValueD's performance, evaluating the significance of the regularisation loss
and the scalability of REValueD with increasing sub-actions per dimension.

Discrete-action 强化学习算法在具有高维离散行动空间的任务中常常表现不佳，由于可能的行动数量庞大。最近的一项进展利用来自多智能体强化学习的价值分解概念来解决这一挑战。本研究深入研究了价值分解的效应，揭示出其虽然减少了 Q-learning 算法固有的过高估计偏差，但却加大了目标方差。为了对抗这一挑战，我们提出了一个评论家集合来减轻目标方差。此外，我们引入了一种正则化损失，帮助减轻一个维度上的探索性行动对其他维度上的最优行动价值的影响。我们的新算法 REValueD 在 DeepMind Control Suite 任务的离散化版本上经过测试，展示了卓越的性能，特别是在挑战性的人形和犬类任务中。我们进一步解剖了影响 REValueD 性能的因素，评估了正则化损失的重要性以及随着每个维度子行动数量的增加，REValueD 的可扩展性。