Although Reinforcement Learning (RL) has shown to be capable of producing
impressive results, its use is limited by the impact of its hyperparameters on
performance. This often makes it difficult to achieve good results in practice.
Automated RL (AutoRL) addresses this difficulty, yet little is known about the
dynamics of the hyperparameter landscapes that hyperparameter optimization
(HPO) methods traverse in search of optimal configurations. In view of existing
AutoRL approaches dynamically adjusting hyperparameter configurations, we
propose an approach to build and analyze these hyperparameter landscapes not
just for one point in time but at multiple points in time throughout training.
Addressing an important open question on the legitimacy of such dynamic AutoRL
approaches, we provide thorough empirical evidence that the hyperparameter
landscapes strongly vary over time across representative algorithms from RL
literature (DQN and SAC) in different kinds of environments (Cartpole and
Hopper). This supports the theory that hyperparameters should be dynamically
adjusted during training and shows the potential for more insights on AutoRL
problems that can be gained through landscape analyses.

该研究提出了一种动态分析强化学习中超参数优化过程中的超参数地形特征的方法，并在实验中得到了支持，进一步说明超参数应该在训练过程中进行动态调整。

AutoRL 超参数景观

AutoRL Hyperparameter Landscapes

Despite a series of recent successes in reinforcement learning (RL), many RL
algorithms remain sensitive to hyperparameters. As such, there has recently
been interest in the field of AutoRL, which seeks to automate design decisions
to create more general algorithms. Recent work suggests that population based
approaches may be effective AutoRL algorithms, by learning hyperparameter
schedules on the fly. In particular, the PB2 algorithm is able to achieve
strong performance in RL tasks by formulating online hyperparameter
optimization as time varying GP-bandit problem, while also providing
theoretical guarantees. However, PB2 is only designed to work for continuous
hyperparameters, which severely limits its utility in practice. In this paper
we introduce a new (provably) efficient hierarchical approach for optimizing
both continuous and categorical variables, using a new time-varying bandit
algorithm specifically designed for the population based training regime. We
evaluate our approach on the challenging Procgen benchmark, where we show that
explicitly modelling dependence between data augmentation and other
hyperparameters improves generalization.

本文介绍了一种新的自动化强化学习算法，使用一种特定的时间变化 bandit 算法来优化持续性和类别性变量的集成，提高了 Procgen 基准测试的泛化性能。

为高效的基于人口的自动强化学习在线调节混合输入超参数

Tuning Mixed Input Hyperparameters on the Fly for Efficient Population  Based AutoRL

Many continuous control tasks have easily formulated objectives, yet using
them directly as a reward in reinforcement learning (RL) leads to suboptimal
policies. Therefore, many classical control tasks guide RL training using
complex rewards, which require tedious hand-tuning. We automate the reward
search with AutoRL, an evolutionary layer over standard RL that treats reward
tuning as hyperparameter optimization and trains a population of RL agents to
find a reward that maximizes the task objective. AutoRL, evaluated on four
Mujoco continuous control tasks over two RL algorithms, shows improvements over
baselines, with the the biggest uplift for more complex tasks. The video can be
found at: https://youtu.be/svdaOFfQyC8.

使用 AutoRL，一种进化层，通过将奖励调整视为超参数优化并训练一组 RL 代理来寻找最大化任务目标的奖励，使得评估了两个 RL 算法上四个 Mujoco 连续控制任务之后 AutoRL 在改善之前的工作基础之上表现出提升，复杂任务上的提升最大。