Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of pretraining across a high-end GPU cluster. However, this paper reveals a compelling finding: transformers exhibit considerable redundancy in pretraining computations, which motivates our proposed solution, Mixed Sparsity Training (MST), an efficient pretraining method that can reduce about $75\%$ of Floating Point Operations (FLOPs) while maintaining performance. MST integrates dynamic sparse training (DST) with Sparsity Variation (SV) and Hybrid Sparse Attention (HSA) during pretraining, involving three distinct phases: warm-up, ultra-sparsification, and restoration. The warm-up phase transforms the dense model into a sparse one, and the restoration phase reinstates connections. Throughout these phases, the model is trained with a dynamically evolving sparse topology and an HSA mechanism to maintain performance and minimize training FLOPs concurrently. Our experiment on GPT-2 showcases a FLOP reduction of $4\times$ without compromising performance.

本研究针对大语言模型在预训练过程中计量分布的高计算需求问题，提出了一种新颖的解决方案：混合稀疏训练（MST）。该方法通过动态稀疏训练与稀疏变换及混合稀疏注意机制相结合，实现了高达75%的浮点运算（FLOPs）减少，且在保持模型性能的同时，对计算效率产生了显著影响。

混合稀疏训练：实现变压器预训练的4倍FLOP减少