Several pre-training objectives, such as masked language modeling (MLM), have
been proposed to pre-train language models (e.g. BERT) with the aim of learning
better language representations. However, to the best of our knowledge, no
previous work so far has investigated how different pre-training objectives
affect what BERT learns about linguistics properties. We hypothesize that
linguistically motivated objectives such as MLM should help BERT to acquire
better linguistic knowledge compared to other non-linguistically motivated
objectives that are not intuitive or hard for humans to guess the association
between the input and the label to be predicted. To this end, we pre-train BERT
with two linguistically motivated objectives and three non-linguistically
motivated ones. We then probe for linguistic characteristics encoded in the
representation of the resulting models. We find strong evidence that there are
only small differences in probing performance between the representations
learned by the two different types of objectives. These surprising results
question the dominant narrative of linguistically informed pre-training.

本文探讨了语言模型的预训练目标对 BERT 学习语言属性的影响，通过使用两个语言学上有意义的目标和三个非语言学动机的目标进行预训练，并发现了这两种不同类型的目标训练出的模型在语言特征表现方面的差异非常小，这也对语言信息熵理论的主流观点提出了疑问。

预训练目标如何影响大型语言模型在语言属性方面的学习？

How does the pre-training objective affect what large language models learn about linguistic properties?

Several studies have investigated the reasons behind the effectiveness of
fine-tuning, usually through the lens of probing. However, these studies often
neglect the role of the size of the dataset on which the model is fine-tuned.
In this paper, we highlight the importance of this factor and its undeniable
role in probing performance. We show that the extent of encoded linguistic
knowledge depends on the number of fine-tuning samples. The analysis also
reveals that larger training data mainly affects higher layers, and that the
extent of this change is a factor of the number of iterations updating the
model during fine-tuning rather than the diversity of the training samples.
Finally, we show through a set of experiments that fine-tuning data size
affects the recoverability of the changes made to the model's linguistic
knowledge.

研究表明优化 Fine-tuning 效果的原因是 Fine-tuning 的数据集大小，同时数据集大小会影响编码的语言知识程度，而且数据集大小主要影响神经网络的高层，且这种影响程度跟 Fine-tuning 迭代次数有关。

探究微调模型时数据大小的重要性

On the Importance of Data Size in Probing Fine-tuned Models

Models of language trained on very large corpora have been demonstrated
useful for NLP. As fixed artifacts, they have become the object of intense
study, with many researchers "probing" the extent to which linguistic
abstractions, factual and commonsense knowledge, and reasoning abilities they
acquire and readily demonstrate. Building on this line of work, we consider a
new question: for types of knowledge a language model learns, when during
(pre)training are they acquired? We plot probing performance across iterations,
using RoBERTa as a case study. Among our findings: linguistic knowledge is
acquired fast, stably, and robustly across domains. Facts and commonsense are
slower and more domain-sensitive. Reasoning abilities are, in general, not
stably acquired. As new datasets, pretraining protocols, and probes emerge, we
believe that probing-across-time analyses can help researchers understand the
complex, intermingled learning that these models undergo and guide us toward
more efficient approaches that accomplish necessary learning faster.

探究语言模型中不同知识类型在 (pre) training 时学习的时间，发现语言知识快速、稳定、跨领域学习；事实和常识知识较慢、受领域限制；而推理能力一般不稳定从而建议研究人员使用更有效的方法加快必要知识的学习。