BriefGPT.xyz
Jan, 2025
MSWA:通过多尺度窗口注意力优化局部注意力
MSWA: Refining Local Attention with Multi-ScaleWindow Attention
HTML
PDF
Yixing Xu, Shivank Nag, Dong Li, Lu Tian, Emad Barsoum
TL;DR
本研究针对传统自注意力机制在计算复杂度和缓存大小方面的限制,提出了多尺度窗口注意力(MSWA),通过在Transformer中为不同头和层采用多样化的窗口大小,以提高对变化尺度上下文的捕捉能力。实验结果表明,MSWA在语言建模和常识推理任务中在效果和效率上均优于传统局部注意力机制。
Abstract
Transformer
-based LLMs have achieved exceptional performance across a wide range of NLP tasks. However, the standard self-
Attention Mechanism
suffers from quadratic time complexity and linearly increased cache si
→