Motivated by the success of Transformers when applied to sequences of
discrete symbols, token-based world models (TBWMs) were recently proposed as
sample-efficient methods. In TBWMs, the world model consumes agent experience
as a language-like sequence of tokens, where each observation constitutes a
sub-sequence. However, during imagination, the sequential token-by-token
generation of next observations results in a severe bottleneck, leading to long
training times, poor GPU utilization, and limited representations. To resolve
this bottleneck, we devise a novel Parallel Observation Prediction (POP)
mechanism. POP augments a Retentive Network (RetNet) with a novel forward mode
tailored to our reinforcement learning setting. We incorporate POP in a novel
TBWM agent named REM (Retentive Environment Model), showcasing a 15.4x faster
imagination compared to prior TBWMs. REM attains superhuman performance on 12
out of 26 games of the Atari 100K benchmark, while training in less than 12
hours. Our code is available at https://github.com/leor-c/REM.

基于语言符号序列的变换器，提出了基于令牌的世界模型（TBWM）。通过引入并行观察预测机制（POP）解决了生成观察的瓶颈问题。将 POP 应用于 TBWM 代理 REM（保持环境模型），在不到 12 小时的训练时间内，在 Atari 100K 基准测试的 12 个游戏中达到超人的表现。

使用并行观测预测改进基于标记的世界模型

Improving Token-Based World Models with Parallel Observation Prediction

We introduce a value-based RL agent, which we call BBF, that achieves
super-human performance in the Atari 100K benchmark. BBF relies on scaling the
neural networks used for value estimation, as well as a number of other design
choices that enable this scaling in a sample-efficient manner. We conduct
extensive analyses of these design choices and provide insights for future
work. We end with a discussion about updating the goalposts for
sample-efficient RL research on the ALE. We make our code and data publicly
available at
this https URL

我们介绍了一个名为 BBF 的价值型强化学习智能体，在 Atari 100K 基准测试中实现了超人类的表现。BBF 依赖于缩放用于价值估计的神经网络，以及其他一些设计选择，以在样本有效的方式下实现此缩放。我们对这些设计选择进行了详细的分析，并提供了未来工作的见解。我们最后讨论了关于在 ALE 上进行样本有效的 RL 研究的目标更新。我们在此提供我们的代码和数据的公开链接。

更大、更好、更快：具备人类效率的人类水平 Atari

Bigger, Better, Faster: Human-level Atari with human-level efficiency

Deep reinforcement learning (RL) algorithms are predominantly evaluated by
comparing their relative performance on a large suite of tasks. Most published
results on deep RL benchmarks compare point estimates of aggregate performance
such as mean and median scores across tasks, ignoring the statistical
uncertainty implied by the use of a finite number of training runs. Beginning
with the Arcade Learning Environment (ALE), the shift towards
computationally-demanding benchmarks has led to the practice of evaluating only
a small number of runs per task, exacerbating the statistical uncertainty in
point estimates. In this paper, we argue that reliable evaluation in the few
run deep RL regime cannot ignore the uncertainty in results without running the
risk of slowing down progress in the field. We illustrate this point using a
case study on the Atari 100k benchmark, where we find substantial discrepancies
between conclusions drawn from point estimates alone versus a more thorough
statistical analysis. With the aim of increasing the field's confidence in
reported results with a handful of runs, we advocate for reporting interval
estimates of aggregate performance and propose performance profiles to account
for the variability in results, as well as present more robust and efficient
aggregate metrics, such as interquartile mean scores, to achieve small
uncertainty in results. Using such statistical tools, we scrutinize performance
evaluations of existing algorithms on other widely used RL benchmarks including
the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies
in prior comparisons. Our findings call for a change in how we evaluate
performance in deep RL, for which we present a more rigorous evaluation
methodology, accompanied with an open-source library rliable, to prevent
unreliable results from stagnating the field.

本文通过案例研究 Atari 100k 游戏数据集，强调在少量训练运行的深度强化学习算法中，为保证结果准确性和防止领域进展停滞，不可忽略数据的不确定性，提出用区间估计来评估强化学习算法的表现，并在常用数据集上分析了已有算法的性能，提出更为严谨的性能评估方法，并配有开源库 rliable。