Chaos-based reinforcement learning (CBRL) is a method in which the agent's
internal chaotic dynamics drives exploration. This approach offers a model for
considering how the biological brain can create variability in its behavior and
learn in an exploratory manner. At the same time, it is a learning model that
has the ability to automatically switch between exploration and exploitation
modes and the potential to realize higher explorations that reflect what it has
learned so far. However, the learning algorithms in CBRL have not been
well-established in previous studies and have yet to incorporate recent
advances in reinforcement learning. This study introduced Twin Delayed Deep
Deterministic Policy Gradients (TD3), which is one of the state-of-the-art deep
reinforcement learning algorithms that can treat deterministic and continuous
action spaces, to CBRL. The validation results provide several insights. First,
TD3 works as a learning algorithm for CBRL in a simple goal-reaching task.
Second, CBRL agents with TD3 can autonomously suppress their exploratory
behavior as learning progresses and resume exploration when the environment
changes. Finally, examining the effect of the agent's chaoticity on learning
shows that extremely strong chaos negatively impacts the flexible switching
between exploration and exploitation.

混沌增强学习（Chaos-based reinforcement learning，CBRL）是一种通过内部混沌动力学驱动探索的方法，本研究将最新的深度强化学习算法之一，即双延迟深度确定性策略梯度算法（Twin Delayed Deep Deterministic Policy Gradients，TD3），引入到 CBRL 中并进行验证。TD3 在简单目标达成任务中作为学习算法有效，CBRL 代理可在学习过程中自主抑制探索行为并在环境变化时恢复探索，而且研究还发现强混沌性对于探索与开采之间的灵活切换产生负面影响。

基于混沌的深度增强学习与 TD3 算法

Chaos-based reinforcement learning with TD3

Deep Reinforcement Learning (DRL) has made tremendous advances in both
simulated and real-world robot control tasks in recent years. Nevertheless,
applying DRL to novel robot control tasks is still challenging, especially when
researchers have to design the action and observation space and the reward
function. In this paper, we investigate partial observability as a potential
failure source of applying DRL to robot control tasks, which can occur when
researchers are not confident whether the observation space fully represents
the underlying state. We compare the performance of three common DRL
algorithms, TD3, SAC and PPO under various partial observability conditions. We
find that TD3 and SAC become easily stuck in local optima and underperform PPO.
We propose multi-step versions of the vanilla TD3 and SAC to improve robustness
to partial observability based on one-step bootstrapping.

本文研究了 Deep Reinforcement Learning 在机器人控制任务中的应用，特别是在部分可观性条件下，比较了 TD3、SAC 和 PPO 算法的表现，并提出了改进部分可观性下 TD3 和 SAC 算法鲁棒性的多步版本算法。

机器人控制的 DRL 过程中的部分可观测性

Partial Observability during DRL for Robot Control

A promising characteristic of Deep Reinforcement Learning (DRL) is its
capability to learn optimal policy in an end-to-end manner without relying on
feature engineering. However, most approaches assume a fully observable state
space, i.e. fully observable Markov Decision Processes (MDPs). In real-world
robotics, this assumption is unpractical, because of issues such as sensor
sensitivity limitations and sensor noise, and the lack of knowledge about
whether the observation design is complete or not. These scenarios lead to
Partially Observable MDPs (POMDPs). In this paper, we propose
Long-Short-Term-Memory-based Twin Delayed Deep Deterministic Policy Gradient
(LSTM-TD3) by introducing a memory component to TD3, and compare its
performance with other DRL algorithms in both MDPs and POMDPs. Our results
demonstrate the significant advantages of the memory component in addressing
POMDPs, including the ability to handle missing and noisy observation data.

本文介绍了一种基于 LSTM-TD3 的方法，该方法引入了记忆组件以应对部分可观察 MDPs，相比其他 DRL 算法，在具有部分可观察 MDPs 的情况下，该方法具有显著的优势，包括处理丢失和噪声观察数据的能力。

基于记忆的深度强化学习在 POMDPs 中的应用

Memory-based Deep Reinforcement Learning for POMDPs

We present a mean-variance policy iteration (MVPI) framework for risk-averse
control in a discounted infinite horizon MDP optimizing the variance of a
per-step reward random variable. MVPI enjoys great flexibility in that any
policy evaluation method and risk-neutral control method can be dropped in for
risk-averse control off the shelf, in both on- and off-policy settings. This
flexibility reduces the gap between risk-neutral control and risk-averse
control and is achieved by working on a novel augmented MDP directly. We
propose risk-averse TD3 as an example instantiating MVPI, which outperforms
vanilla TD3 and many previous risk-averse control methods in challenging Mujoco
robot simulation tasks under a risk-aware performance metric. This risk-averse
TD3 is the first to introduce deterministic policies and off-policy learning
into risk-averse reinforcement learning, both of which are key to the
performance boost we show in Mujoco domains.

本研究提出了一种基于平均方差策略迭代 (MVPI) 框架的风险规避控制方法，采用任意策略评估方法和风险中立控制方法，通过对一个新颖的扩展 MDP 直接进行处理，减少风险中立控制与风险规避控制之间的差距，并介绍了一种风险规避 TD3 方法作为 MVPI 的示例。该方法在 Mujoco 机器人仿真任务中优于传统 TD3 方法和其他风险规避控制方法。