The combination of deep reinforcement learning (DRL) with ensemble methods
has been proved to be highly effective in addressing complex sequential
decision-making problems. This success can be primarily attributed to the
utilization of multiple models, which enhances both the robustness of the
policy and the accuracy of value function estimation. However, there has been
limited analysis of the empirical success of current ensemble RL methods thus
far. Our new analysis reveals that the sample efficiency of previous ensemble
DRL algorithms may be limited by sub-policies that are not as diverse as they
could be. Motivated by these findings, our study introduces a new ensemble RL
algorithm, termed \textbf{T}rajectories-awar\textbf{E} \textbf{E}nsemble
exploratio\textbf{N} (TEEN). The primary goal of TEEN is to maximize the
expected return while promoting more diverse trajectories. Through extensive
experiments, we demonstrate that TEEN not only enhances the sample diversity of
the ensemble policy compared to using sub-policies alone but also improves the
performance over ensemble RL algorithms. On average, TEEN outperforms the
baseline ensemble DRL algorithms by 41\% in performance on the tested
representative environments.

通过使用深度强化学习和集成方法，我们提出了一种新的集成强化学习算法 TEEN，在实验证明 TEEN 相对于仅使用子策略能够增加集成策略的样本多样性，并且在性能上表现更好，平均而言 TEEN 在经过测试的代表性环境中比基线集成强化学习算法的性能提高了 41%。

保持多样轨迹：促进连续控制中集合策略的探索

Keep Various Trajectories: Promoting Exploration of Ensemble Policies in  Continuous Control

In classic reinforcement learning algorithms, agents make decisions at
discrete and fixed time intervals. The physical duration between one decision
and the next becomes a critical hyperparameter. When this duration is too
short, the agent needs to make many decisions to achieve its goal, aggravating
the problem's difficulty. But when this duration is too long, the agent becomes
incapable of controlling the system. Physical systems, however, do not need a
constant control frequency. For learning agents, it is desirable to operate
with low frequency when possible and high frequency when necessary. We propose
a framework called Continuous-Time Continuous-Options (CTCO), where the agent
chooses options as sub-policies of variable durations. Such options are
time-continuous and can interact with the system at any desired frequency
providing a smooth change of actions. The empirical analysis shows that our
algorithm is competitive w.r.t. other time-abstraction techniques, such as
classic option learning and action repetition, and practically overcomes the
difficult choice of the decision frequency.

本研究提出一种名为 CTCO 的框架，使学习智能体通过选择变量持续时间的子策略来实现在可能的情况下以低频率运作，并在必要时以高频率运作，从而克服了决策频率选择的困难。

变量决策频率选项评论家

Variable-Decision Frequency Option Critic

Adopting reasonable strategies is challenging but crucial for an intelligent
agent with limited resources working in hazardous, unstructured, and dynamic
environments to improve the system utility, decrease the overall cost, and
increase mission success probability. Deep Reinforcement Learning (DRL) helps
organize agents' behaviors and actions based on their state and represents
complex strategies (composition of actions). This paper proposes a novel
hierarchical strategy decomposition approach based on Bayesian chaining to
separate an intricate policy into several simple sub-policies and organize
their relationships as Bayesian strategy networks (BSN). We integrate this
approach into the state-of-the-art DRL method, soft actor-critic (SAC), and
build the corresponding Bayesian soft actor-critic (BSAC) model by organizing
several sub-policies as a joint policy. We compare the proposed BSAC method
with the SAC and other state-of-the-art approaches such as TD3, DDPG, and PPO
on the standard continuous control benchmarks -- Hopper-v2, Walker2d-v2, and
Humanoid-v2 -- in MuJoCo with the OpenAI Gym environment. The results
demonstrate that the promising potential of the BSAC method significantly
improves training efficiency. The open sourced codes for BSAC can be accessed
at this https URL.

本文提出了一种新颖的基于贝叶斯链的层次策略分解方法，将策略分解为多个简单的子策略，并将它们的关系组织为贝叶斯策略网络，将其集成到最先进的深度强化学习方法中，即软性演员批评家模型（SAC），并构建相应的贝叶斯软性演员批评家模型（BSAC），这种方法通过将多个子策略组织为一个联合策略，实现了良好的性能并显著提高了训练效率。