Large Transformer models yield impressive results on many tasks, but are
expensive to train, or even fine-tune, and so slow at decoding that their use
and study becomes out of reach. We address this problem by leveraging sparsity.
We study sparse variants for all layers in the Transformer and propose Scaling
Transformers, a family of next generation Transformer models that use sparse
layers to scale efficiently and perform unbatched decoding much faster than the
standard Transformer as we scale up the model size. Surprisingly, the sparse
layers are enough to obtain the same perplexity as the standard Transformer
with the same number of parameters. We also integrate with prior sparsity
approaches to attention and enable fast inference on long sequences even with
limited memory. This results in performance competitive to the state-of-the-art
on long text summarization.

本研究提出了一种用于构建下一代 Transformer 模型的方法，即利用稀疏层进行有效缩放和高效执行非批量解码。结果表明，这种模型在拥有相同参数数量的情况下，可以获得与标准 Transformer 相同的效果，并且在长文本摘要方面表现优异。