A line of work on Transformer-based language models such as BERT has
attempted to use syntactic inductive bias to enhance the pretraining process,
on the theory that building syntactic structure into the training process
should reduce the amount of data needed for training. But such methods are
often tested for high-resource languages such as English. In this work, we
investigate whether these methods can compensate for data sparseness in
low-resource languages, hypothesizing that they ought to be more effective for
low-resource languages. We experiment with five low-resource languages: Uyghur,
Wolof, Maltese, Coptic, and Ancient Greek. We find that these syntactic
inductive bias methods produce uneven results in low-resource settings, and
provide surprisingly little benefit in most cases.

一项关于基于 Transformer 的语言模型（如 BERT）的研究尝试使用语法归纳偏置来增强预训练过程，理论上通过将语法结构融入训练过程可以降低训练所需的数据量。但此类方法通常在高资源语言（如英语）上进行测试。在这项研究中，我们调查了这些方法是否可以弥补低资源语言中的数据稀疏性，研究假设它们在低资源语言中应该更加有效。我们对五种低资源语言进行了实验：维吾尔语、沃洛夫语、马耳他语、科普特语和古希腊语。我们发现这些语法归纳偏置方法在低资源环境中产生不均匀的结果，并在大多数情况下提供出乎意料的少量益处。

Transformer 语言模型中的句法归纳偏置：对低资源语言特别有帮助吗？

Syntactic Inductive Bias in Transformer Language Models: Especially  Helpful for Low-Resource Languages?

The emergent cross-lingual transfer seen in multilingual pretrained models
has sparked significant interest in studying their behavior. However, because
these analyses have focused on fully trained multilingual models, little is
known about the dynamics of the multilingual pretraining process. We
investigate when these models acquire their in-language and cross-lingual
abilities by probing checkpoints taken from throughout XLM-R pretraining, using
a suite of linguistic tasks. Our analysis shows that the model achieves high
in-language performance early on, with lower-level linguistic skills acquired
before more complex ones. In contrast, the point in pretraining when the model
learns to transfer cross-lingually differs across language pairs.
Interestingly, we also observe that, across many languages and tasks, the final
model layer exhibits significant performance degradation over time, while
linguistic knowledge propagates to lower layers of the network. Taken together,
these insights highlight the complexity of multilingual pretraining and the
resulting varied behavior for different languages over time.

本研究旨在探究跨语言预训练模型的学习过程，发现该模型在语言内表现出较高的性能，复杂任务在低级语言技能前学习。添加不同的语言对跨语言转移的学习时机不同，并且最终模型层表现存在时间衰减现象，语言知识向网络底层传递。