The varying significance of distinct primitive behaviors during the policy
learning process has been overlooked by prior model-free RL algorithms.
Leveraging this insight, we explore the causal relationship between different
action dimensions and rewards to evaluate the significance of various primitive
behaviors during training. We introduce a causality-aware entropy term that
effectively identifies and prioritizes actions with high potential impacts for
efficient exploration. Furthermore, to prevent excessive focus on specific
primitive behaviors, we analyze the gradient dormancy phenomenon and introduce
a dormancy-guided reset mechanism to further enhance the efficacy of our
method. Our proposed algorithm, ACE: Off-policy Actor-critic with
Causality-aware Entropy regularization, demonstrates a substantial performance
advantage across 29 diverse continuous control tasks spanning 7 domains
compared to model-free RL baselines, which underscores the effectiveness,
versatility, and efficient sample efficiency of our approach. Benchmark results
and videos are available at this https URL

我们提出了 ACE 算法：基于因果关系的策略梯度法，通过引入因果关系熵项和梯度休眠现象引导重置机制，取得了在连续控制任务上显著的性能优势。

ACE：具有因果感知熵调整的离策略演员 - 评论家算法

ACE : Off-Policy Actor-Critic with Causality-Aware Entropy  Regularization

Recent advances in real-world applications of reinforcement learning (RL)
have relied on the ability to accurately simulate systems at scale. However,
domains such as fluid dynamical systems exhibit complex dynamic phenomena that
are hard to simulate at high integration rates, limiting the direct application
of modern deep RL algorithms to often expensive or safety critical hardware. In
this work, we introduce "Box o Flows", a novel benchtop experimental control
system for systematically evaluating RL algorithms in dynamic real-world
scenarios. We describe the key components of the Box o Flows, and through a
series of experiments demonstrate how state-of-the-art model-free RL algorithms
can synthesize a variety of complex behaviors via simple reward specifications.
Furthermore, we explore the role of offline RL in data-efficient hypothesis
testing by reusing past experiences. We believe that the insights gained from
this preliminary study and the availability of systems like the Box o Flows
support the way forward for developing systematic RL algorithms that can be
generally applied to complex, dynamical systems. Supplementary material and
videos of experiments are available at
this https URL

近期增加在强化学习实际应用方面的研究，依赖于能够在规模上准确模拟系统。然而，液体动力学系统等领域展示了复杂的动态现象，难以以高积分速率进行模拟，限制了现代深度强化学习算法在昂贵或安全关键硬件上的直接应用。在本研究中，我们引入了 “Box o Flows”，这是一个新颖的台面实验控制系统，用于系统地评估动态实际环境中的强化学习算法。我们描述了 Box o Flows 的关键组成部分，并通过一系列实验演示了最新的无模型强化学习算法如何通过简单的奖励规范来合成各种复杂行为。此外，我们通过重用过去的经验，探讨了离线强化学习在数据高效假设测试中的作用。我们相信，从这个初步研究中获得的见解以及像 Box o Flows 这样的系统的可用性，将支持开发可普遍应用于复杂动态系统的系统化强化学习算法。附加材料和实验视频可在以下链接找到：[URL]