The pretraining-fine-tuning paradigm has been the de facto strategy for
transfer learning in modern language modeling. With the understanding that task
adaptation in LMs is often a function of parameters shared across tasks, we
argue that a more surgical approach to regularization needs to exist for
smoother transfer learning. Towards this end, we investigate how the
pretraining loss landscape is affected by these task-sensitive parameters
through an information-theoretic lens. We then leverage the findings from our
investigations to devise a novel approach to dropout for improved model
regularization and better downstream generalization. This approach, named
guided dropout, is both task & architecture agnostic and adds no computational
overhead to the fine-tuning process. Through empirical evaluations, we showcase
that our approach to regularization yields consistently better performance,
even in scenarios of data paucity, compared to standardized baselines.

传统的预训练 - 微调策略已被视为现代语言建模中的转移学习策略，但需要更具目标敏感性的参数正则化方法以实现更平滑的转移学习。本文通过信息论的角度研究了预训练损失函数在任务敏感参数上的影响，并利用研究结果提出了一种新颖的用于改善模型正则化和下游泛化性能的 dropout 方法，名为 guided dropout。通过实证评估表明，相比于标准基线，在数据稀缺的情况下，我们的正则化方法始终能够得到更好的性能。

信息引导的正则化用于微调语言模型

Information Guided Regularization for Fine-tuning Language Models

Successful application processing sequential data, such as text and speech,
requires an improved generalization performance of recurrent neural networks
(RNNs). Dropout techniques for RNNs were introduced to respond to these
demands, but we conjecture that the dropout on RNNs could have been improved by
adopting the adversarial concept. This paper investigates ways to improve the
dropout for RNNs by utilizing intentionally generated dropout masks.
Specifically, the guided dropout used in this research is called as adversarial
dropout, which adversarially disconnects neurons that are dominantly used to
predict correct targets over time. Our analysis showed that our regularizer,
which consists of a gap between the original and the reconfigured RNNs, was the
upper bound of the gap between the training and the inference phases of the
random dropout. We demonstrated that minimizing our regularizer improved the
effectiveness of the dropout for RNNs on sequential MNIST tasks,
semi-supervised text classification tasks, and language modeling tasks.

通过采用敌对概念生成的 dropout mask 来改进循环神经网络的性能，实现了对于时序 MNIST 任务、半监督文本分类任务和语言建模任务中 RNNs 的 dropout 技术的有效性提高。