Recently, sequence learning methods have been applied to the problem of
off-policy Reinforcement Learning, including the seminal work on Decision
Transformers, which employs transformers for this task. Since transformers are
parameter-heavy, cannot benefit from history longer than a fixed window size,
and are not computed using recurrence, we set out to investigate the
suitability of the S4 family of models, which are based on state-space layers
and have been shown to outperform transformers, especially in modeling
long-range dependencies. In this work we present two main algorithms: (i) an
off-policy training procedure that works with trajectories, while still
maintaining the training efficiency of the S4 model. (ii) An on-policy training
procedure that is trained in a recurrent manner, benefits from long-range
dependencies, and is based on a novel stable actor-critic mechanism. Our
results indicate that our method outperforms multiple variants of decision
transformers, as well as the other baseline methods on most tasks, while
reducing the latency, number of parameters, and training time by several orders
of magnitude, making our approach more suitable for real-world RL.

本研究提出两种算法：一种通过轨迹实现离线训练，另一种通过一种基于稳定 Actor-Critic 机制的循环训练方法实现在线训练，实验结果证明该方法优于多种变体的决策 Transformer 以及其他基准方法，同时降低了延迟、参数数量和训练时间，更适用于现实世界的 RL。