自注意力层的拟态初始化

May, 2023

Mimetic Initialization of Self-Attention Layers

Asher Trockman, J. Zico Kolter

TL;DR通过模仿预训练Transformer的权重，使用模拟初始化方案沿用这些权重，能在视觉任务中提高Vanilla Transformers的最终准确度，并使训练速度更快。

Abstract

It is notoriously difficult to train transformers on small datasets; typically, large pre-trained models are instead used as the starting point. We explore the weights of such pre-trained →