BriefGPT.xyz
Aug, 2024
混合稀疏训练:实现变压器预训练的4倍FLOP减少
Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining
HTML
PDF
Pihe Hu, Shaolong Li, Longbo Huang
TL;DR
本研究针对大语言模型在预训练过程中计量分布的高计算需求问题,提出了一种新颖的解决方案:混合稀疏训练(MST)。该方法通过动态稀疏训练与稀疏变换及混合稀疏注意机制相结合,实现了高达75%的浮点运算(FLOPs)减少,且在保持模型性能的同时,对计算效率产生了显著影响。
Abstract
Large Language Models
(LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters,
Transformer
-based L
→