Transformers achieve state-of-the-art performance for natural language processing tasks by pre-training on large-scale text corpora. They are extremely compute-intensive and have very high sample complexity. Memory replay is a mechanism that remembers and reuses past examples by saving to and replaying from a memory buffer. It has been successfully used in reinforcement learning and GANs due to better sample efficiency. In this paper, we propose \emph{Transformer with Memory Replay} (TMR), which integrates memory replay with transformer, making transformer more sample-efficient. Experiments on GLUE and SQuAD benchmark datasets show that Transformer with Memory Replay achieves at least $1\%$ point increase compared to the baseline transformer model when pretrained with the same number of examples. Further, by adopting a careful design that reduces the wall-clock time overhead of memory replay, we also empirically achieve a better runtime efficiency.

本文提出了一种记忆重放机制与Transformer相结合的方法，称为Transformer with Memory Replay（TMR），在大规模文本语料库上预训练，使Transformer更具样本效率。在GLUE和SQuAD基准数据集上进行的实验显示，与基线transformer模型相比，当预先训练相同数量的示例时，使用记忆重放的Transformer可至少提高1％。此外，通过采用减少内存重放的时钟时间开销的仔细设计，也实现了更好的运行时效率。

带有记忆回放的Transformer