We introduce Tangent Attention Fine-Tuning (TAFT), a method for fine-tuning linearized transformers obtained by computing a First-order Taylor Expansion around a pre-trained initialization. We show that the Jacobian-Vector Product resulting from linearization can be computed efficiently in a single forward pass, reducing training and inference cost to the same order of magnitude as its original non-linear counterpart, while using the same number of parameters. Furthermore, we show that, when applied to various downstream visual classification tasks, the resulting Tangent Transformer fine-tuned with TAFT can perform comparably with fine-tuning the original non-linear network. Since Tangent Transformers are linear with respect to the new set of weights, and the resulting fine-tuning loss is convex, we show that TAFT enjoys several advantages compared to non-linear fine-tuning when it comes to model composition, parallel training, machine unlearning, and differential privacy.

介绍一种基于Tangent Attention Fine-Tuning (TAFT)的方法，通过对预先训练的初始化进行一阶泰勒展开的计算得到线性化的transformers，从而使得线性化所得的Jacobian-Vector Product可以在一个前向传播中高效计算，从而降低了训练和推理成本同时保持与原始非线性对应物相同的参数数量；此外，当应用于各种下游视觉分类任务时，用TAFT进行精调得到的Tangent Transformer可以在性能方面与fine-tuning原始非线性网络的结果相当。由于Tangent Transformers相对于新权重集合是线性的，并且由此产生的fine-tuning损失具有凸性，因此TAFT在模型组合，并行训练，机器遗忘和差分隐私方面比非线性微调具有更多优势。

用于合成、隐私和去除的切向变换器