BriefGPT.xyz
Jun, 2024
对称点积注意力用于BERT语言模型的高效训练
Symmetric Dot-Product Attention for Efficient Training of BERT Language Models
HTML
PDF
Martin Courtois, Malte Ostendorff, Leonhard Hennig, Georg Rehm
TL;DR
提出了一种与Transformer架构的自注意力机制兼容的替代性兼容函数,并在类似BERT模型的预训练中实现了对称的注意力机制,在GLUE基准测试中得分79.36,减少了可训练参数数量的6%,并将收敛前所需的训练步骤减少了一半。
Abstract
Initially introduced as a machine translation model, the
transformer architecture
has now become the foundation for modern
deep learning
architecture, with applications in a wide range of fields, from computer vi
→