Despite the massive success of fine-tuning large Pre-trained Language Models
(PLMs) on a wide range of Natural Language Processing (NLP) tasks, they remain
susceptible to out-of-distribution (OOD) and adversarial inputs. Data map (DM)
is a simple yet effective dual-model approach that enhances the robustness of
fine-tuned PLMs, which involves fine-tuning a model on the original training
set (i.e. reference model), selecting a specified fraction of important
training examples according to the training dynamics of the reference model,
and fine-tuning the same model on these selected examples (i.e. main model).
However, it suffers from the drawback of requiring fine-tuning the same model
twice, which is computationally expensive for large models. In this paper, we
first show that 1) training dynamics are highly transferable across different
model sizes and different pre-training methods, and that 2) main models
fine-tuned using DM learn faster than when using conventional Empirical Risk
Minimization (ERM). Building on these observations, we propose a novel
fine-tuning approach based on the DM method: Fine-Tuning by transFerring
Training dynamics (FTFT). Compared with DM, FTFT uses more efficient reference
models and then fine-tunes more capable main models for fewer steps. Our
experiments show that FTFT achieves better generalization robustness than ERM
while spending less than half of the training cost.

通过数据映射方法和引入训练动态的方式，本文提出一种新的微调方法（FTFT），相比传统的经验风险最小化方法（ERM），该方法在训练成本减半的情况下达到更好的泛化鲁棒性。

FTFT：高效稳健的微调算法 —— 通过迁移训练动态

FTFT: efficient and robust Fine-Tuning by transFerring Training dynamics

Traditional normalization techniques (e.g., Batch Normalization and Instance
Normalization) generally and simplistically assume that training and test data
follow the same distribution. As distribution shifts are inevitable in
real-world applications, well-trained models with previous normalization
methods can perform badly in new environments. Can we develop new normalization
methods to improve generalization robustness under distribution shifts? In this
paper, we answer the question by proposing CrossNorm and SelfNorm. CrossNorm
exchanges channel-wise mean and variance between feature maps to enlarge
training distribution, while SelfNorm uses attention to recalibrate the
statistics to bridge gaps between training and test distributions. CrossNorm
and SelfNorm can complement each other, though exploring different directions
in statistics usage. Extensive experiments on different fields (vision and
language), tasks (classification and segmentation), settings (supervised and
semi-supervised), and distribution shift types (synthetic and natural) show the
effectiveness. Code is available at
this https URL

本文介绍了新的标准化技术 ——CrossNorm 和 SelfNorm，通过交换特征图之间的通道均值和方差以及使用注意力重新校准统计量来改善在 distribution shifts 下的泛化鲁棒性，证明了其在不同领域（视觉和语言），任务（分类和分割），设置（有监督和半监督）和分布转移类型（合成和自然）下非常有效。