线性时间中的Transformer质量

Feb, 2022

Transformer Quality in Linear Time

Weizhe Hua, Zihang Dai, Hanxiao Liu, Quoc V. Le

TL;DR本文提出使用门控注意力单元和线性近似方法改良Transformers模型的方法，新模型命名为FLASH。该模型在短序列和长序列上都达到了改良Transformers的分词结果，同时在Wiki-40B和PG-19的自回归语言模型上训练速度最多提升了4.9倍，在掩蔽语言模型上提升了4.8倍。

Abstract

We revisit the design choices in transformers, and propose methods to address their weaknesses in handling long sequences. First, we propose a simple layer named gated attention unit, which allows the use of a we