Reinforcement learning has traditionally focused on learning state-dependent
policies to solve optimal control problems in a closed-loop fashion. In this
work, we introduce the paradigm of open-loop reinforcement learning where a
fixed action sequence is learned instead. We present three new algorithms: one
robust model-based method and two sample-efficient model-free methods. Rather
than basing our algorithms on Bellman's equation from dynamic programming, our
work builds on Pontryagin's principle from the theory of open-loop optimal
control. We provide convergence guarantees and evaluate all methods empirically
on a pendulum swing-up task, as well as on two high-dimensional MuJoCo tasks,
demonstrating remarkable performance compared to existing baselines.

传统上，强化学习集中于学习状态相关策略以解决闭环最优控制问题；本文提出了开环强化学习范式，通过学习固定行动序列，引入了三种新算法：一种鲁棒的基于模型的方法和两种高效的无模型方法。基于开环最优控制理论中的庞特里亚金原理，而非动态规划中的贝尔曼方程，我们提供了收敛性保证，并在振子摆起任务以及两个高维 MuJoCo 任务上通过实证评估展示了与现有基线方法相比显着的性能。

透视强化学习的庞特里亚金方法

A Pontryagin Perspective on Reinforcement Learning

Most reinforcement learning methods rely heavily on dense, well-normalized
environment rewards. DreamerV3 recently introduced a model-based method with a
number of tricks that mitigate these limitations, achieving state-of-the-art on
a wide range of benchmarks with a single set of hyperparameters. This result
sparked discussion about the generality of the tricks, since they appear to be
applicable to other reinforcement learning algorithms. Our work applies
DreamerV3's tricks to PPO and is the first such empirical study outside of the
original work. Surprisingly, we find that the tricks presented do not transfer
as general improvements to PPO. We use a high quality PPO reference
implementation and present extensive ablation studies totaling over 10,000 A100
hours on the Arcade Learning Environment and the DeepMind Control Suite. Though
our experiments demonstrate that these tricks do not generally outperform PPO,
we identify cases where they succeed and offer insight into the relationship
between the implementation tricks. In particular, PPO with these tricks
performs comparably to PPO on Atari games with reward clipping and
significantly outperforms PPO without reward clipping.

基于模型方法 DreamerV3 的实验研究，揭示了 DreamerV3 的技巧在强化学习算法 PPO 中不适用的情况，同时还对技巧的实现方式及其对性能的影响进行了深入分析。

通过 DreamerV3 技巧提高 Proximal Policy Optimization 的奖励尺度鲁棒性

Reward Scale Robustness for Proximal Policy Optimization via DreamerV3  Tricks

Deep reinforcement learning (DRL) has achieved significant success in various
robot tasks: manipulation, navigation, etc. However, complex visual
observations in natural environments remains a major challenge. This paper
presents Contrastive Variational Reinforcement Learning (CVRL), a model-based
method that tackles complex visual observations in DRL. CVRL learns a
contrastive variational model by maximizing the mutual information between
latent states and observations discriminatively, through contrastive learning.
It avoids modeling the complex observation space unnecessarily, as the commonly
used generative observation model often does, and is significantly more robust.
CVRL achieves comparable performance with state-of-the-art model-based DRL
methods on standard Mujoco tasks. It significantly outperforms them on Natural
Mujoco tasks and a robot box-pushing task with complex observations, e.g.,
dynamic shadows. The CVRL code is available publicly at
this https URL

通过强化学习中的对比变分方法来解决视觉观测中的复杂性问题，在 Mujoco 任务和机器人推箱子任务中达到了与现有方法相当的状态，并在自然 Mujoco 任务中显著优于它们。