In continual or lifelong reinforcement learning access to the environment
should be limited. If we aspire to design algorithms that can run for
long-periods of time, continually adapting to new, unexpected situations then
we must be willing to deploy our agents without tuning their hyperparameters
over the agent's entire lifetime. The standard practice in deep RL -- and even
continual RL -- is to assume unfettered access to deployment environment for
the full lifetime of the agent. This paper explores the notion that progress in
lifelong RL research has been held back by inappropriate empirical
methodologies. In this paper we propose a new approach for tuning and
evaluating lifelong RL agents where only one percent of the experiment data can
be used for hyperparameter tuning. We then conduct an empirical study of DQN
and Soft Actor Critic across a variety of continuing and non-stationary
domains. We find both methods generally perform poorly when restricted to
one-percent tuning, whereas several algorithmic mitigations designed to
maintain network plasticity perform surprising well. In addition, we find that
properties designed to measure the network's ability to learn continually
indeed correlate with performance under one-percent tuning.

本文研究了终身强化学习中的关键问题，通过新的调优和评估方法，在只有百分之一的实验数据用于超参数调整的情况下，发现 DQN 和 Soft Actor Critic 方法表现不佳，而一些保持网络可塑性的算法措施表现出色，并且网络不断学习的能力与百分之一调优下的性能关联密切。

针对未知情况的调整：重新审视终身强化学习的评估策略

Tuning for the Unknown: Revisiting Evaluation Strategies for Lifelong RL

In this paper, we define, evaluate, and improve the ``relay-generalization''
performance of reinforcement learning (RL) agents on the out-of-distribution
``controllable'' states. Ideally, an RL agent that generally masters a task
should reach its goal starting from any controllable state of the environment
instead of memorizing a small set of trajectories. For example, a self-driving
system should be able to take over the control from humans in the middle of
driving and must continue to drive the car safely. To practically evaluate this
type of generalization, we start the test agent from the middle of other
independently well-trained \emph{stranger} agents' trajectories. With extensive
experimental evaluation, we show the prevalence of \emph{generalization
failure} on controllable states from stranger agents. For example, in the
Humanoid environment, we observed that a well-trained Proximal Policy
Optimization (PPO) agent, with only 3.9\% failure rate during regular testing,
failed on 81.6\% of the states generated by well-trained stranger PPO agents.
To improve "relay generalization," we propose a novel method called
Self-Trajectory Augmentation (STA), which will reset the environment to the
agent's old states according to the Q function during training. After applying
STA to the Soft Actor Critic's (SAC) training procedure, we reduced the failure
rate of SAC under relay-evaluation by more than three times in most settings
without impacting agent performance and increasing the needed number of
environment interactions. Our code is available at
this https URL

本文主要研究重新概括（relay-generalization）强化学习（reinforcement learning，RL）代理人在可控状态下的性能，并提出一种名为自我轨迹增强（Self-Trajectory Augmentation，STA）的新方法，用于提高代理人在这类状态下的泛化性能，实验证明这种方法有效。

智能体与陌生人进行接力赛？将强化学习泛化到分布之外的轨迹

Can Agents Run Relay Race with Strangers? Generalization of RL to  Out-of-Distribution Trajectories

Recently, the applications of deep neural network (DNN) have been very
prominent in many fields such as computer vision (CV) and natural language
processing (NLP) due to its superior feature extraction performance. However,
the high-dimension parameter model and large-scale mathematical calculation
restrict the execution efficiency, especially for Internet of Things (IoT)
devices. Different from the previous cloud/edge-only pattern that brings huge
pressure for uplink communication and device-only fashion that undertakes
unaffordable calculation strength, we highlight the collaborative computation
between the device and edge for DNN models, which can achieve a good balance
between the communication load and execution accuracy. Specifically, a
systematic on-demand co-inference framework is proposed to exploit the
multi-branch structure, in which the pre-trained Alexnet is right-sized through
\emph{early-exit} and partitioned at an intermediate DNN layer. The integer
quantization is enforced to further compress transmission bits. As a result, we
establish a new Deep Reinforcement Learning (DRL) optimizer-Soft Actor Critic
for discrete (SAC-d), which generates the \emph{exit point}, \emph{partition
point}, and \emph{compressing bits} by soft policy iterations. Based on the
latency and accuracy aware reward design, such an optimizer can well adapt to
the complex environment like dynamic wireless channel and arbitrary CPU
processing, and is capable of supporting the 5G URLLC. Real-world experiment on
Raspberry Pi 4 and PC shows the outperformance of the proposed solution.

提出了一种基于 Deep Neural Network（DNN）模型的 IoT 设备与边缘协同计算框架，通过多分支结构、智能早停、硬件中间分割与整数量化等技术实现了优秀的通信负载和执行精度平衡，结合基于 Soft Actor Critic（SAC-d）的深度强化学习优化算法实现了动态无线通道和任意 CPU 处理下的适应性支持，并在树莓派 4 和 PC 上进行了实验。

基于软 Actor-Critic 的自适应设备 - 边缘联合推理框架

An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic

Deep Reinforcement Learning (DRL) algorithms for continuous action spaces are
known to be brittle toward hyperparameters as well as \cut{being}sample
inefficient. Soft Actor Critic (SAC) proposes an off-policy deep actor critic
algorithm within the maximum entropy RL framework which offers greater
stability and empirical gains. The choice of policy distribution, a factored
Gaussian, is motivated by \cut{chosen due}its easy re-parametrization rather
than its modeling power. We introduce Normalizing Flow policies within the SAC
framework that learn more expressive classes of policies than simple factored
Gaussians. \cut{We also present a series of stabilization tricks that enable
effective training of these policies in the RL setting.}We show empirically on
continuous grid world tasks that our approach increases stability and is better
suited to difficult exploration in sparse reward settings.

该研究提出了一种基于 Soft Actor Critic 算法的正态流策略分布模型，增加了模型的表达能力以提高稳定性和适应稀疏奖励环境下的探索能力。