BriefGPT.xyz
May, 2023
自注意力层的拟态初始化
Mimetic Initialization of Self-Attention Layers
HTML
PDF
Asher Trockman, J. Zico Kolter
TL;DR
通过模仿预训练Transformer的权重,使用模拟初始化方案沿用这些权重,能在视觉任务中提高Vanilla Transformers的最终准确度,并使训练速度更快。
Abstract
It is notoriously difficult to train
transformers
on small datasets; typically, large
pre-trained models
are instead used as the starting point. We explore the weights of such pre-trained
→