Sparsifying the Transformer has garnered considerable interest, as training the Transformer is very computationally demanding. Prior efforts to sparsify the Transformer have either used a fixed pattern or data-driven approach to reduce the number of operations involving the computation of multi-head attention, which is the main bottleneck of the Transformer. However, existing methods suffer from inevitable problems, such as the potential loss of essential sequence features due to the uniform fixed pattern applied across all layers, and an increase in the model size resulting from the use of additional parameters to learn sparsity patterns in attention operations. In this paper, we propose a novel sparsification scheme for the Transformer that integrates convolution filters and the flood filling method to efficiently capture the layer-wise sparse pattern in attention operations. Our sparsification approach reduces the computational complexity and memory footprint of the Transformer during training. Efficient implementations of the layer-wise sparsified attention algorithm on GPUs are developed, demonstrating a new SPION that achieves up to 3.08X speedup over existing state-of-the-art sparse Transformer models, with better evaluation quality.

我们提出了一种新颖的Transformer稀疏化方案，通过整合卷积滤波器和泛洪填充方法，高效捕捉自注意操作中的逐层稀疏模式，从而在Transformer的训练过程中降低计算复杂度和内存占用，实现了超过现有稀疏Transformer模型的3.08倍加速，并具有更好的评估质量。

SPION: 基于卷积漫延的Transformer分层稀疏训练