Owing to their ability to both effectively integrate information over long
time horizons and scale to massive amounts of data, self-attention
architectures have recently shown breakthrough success in natural language
processing (NLP), achieving state-of-the-art results in domains such as
language modeling and machine translation. Harnessing the transformer's ability
to process long time horizons of information could provide a similar
performance boost in partially observable reinforcement learning (RL) domains,
but the large-scale transformers used in NLP have yet to be successfully
applied to the RL setting. In this work we demonstrate that the standard
transformer architecture is difficult to optimize, which was previously
observed in the supervised learning setting but becomes especially pronounced
with RL objectives. We propose architectural modifications that substantially
improve the stability and learning speed of the original Transformer and XL
variant. The proposed architecture, the Gated Transformer-XL (GTrXL), surpasses
LSTMs on challenging memory environments and achieves state-of-the-art results
on the multi-task DMLab-30 benchmark suite, exceeding the performance of an
external memory architecture. We show that the GTrXL, trained using the same
losses, has stability and performance that consistently matches or exceeds a
competitive LSTM baseline, including on more reactive tasks where memory is
less critical. GTrXL offers an easy-to-train, simple-to-implement but
substantially more expressive architectural alternative to the standard
multi-layer LSTM ubiquitously used for RL agents in partially observable
environments.

在自然语言处理领域得到了突破性的成功后，本文提出一种修改后的” 转换器” 架构，即门控 Transformer-XL (GTrXL)，在部分可观察的强化学习 RL 领域中实现了与竞争性 LSTM 基线相媲美的稳定性和性能，超过了 LSTM 并在多任务 DMLab-30 基准套件上取得了最新的成果。

强化学习中的 Transformer 稳定化

Stabilizing Transformers for Reinforcement Learning

Searching the space of policies directly for the optimal policy has been one
popular method for solving partially observable reinforcement learning
problems. Typically, with each change of the target policy, its value is
estimated from the results of following that very policy. This requires a large
number of interactions with the environment as different polices are
considered. We present a family of algorithms based on likelihood ratio
estimation that use data gathered when executing one policy (or collection of
policies) to estimate the value of a different policy. The algorithms combine
estimation and optimization stages. The former utilizes experience to build a
non-parametric representation of an optimized function. The latter performs
optimization on this estimate. We show positive empirical results and provide
the sample complexity bound.

使用似然比估计的一族算法在估计和优化阶段利用经验数据来优化策略，从而更高效地解决部分可观察的强化学习问题，该算法在实验中表现良好。