The use of neural language models to model human behavior has met with mixed
success. While some work has found that the surprisal estimates from these
models can be used to predict a wide range of human neural and behavioral
responses, other work studying more complex syntactic phenomena has found that
these surprisal estimates generate incorrect behavioral predictions. This paper
explores the extent to which the misalignment between empirical and
model-predicted behavior can be minimized by training models on more
developmentally plausible data, such as in the BabyLM Challenge. We trained
teacher language models on the BabyLM "strict-small" dataset and used sentence
level surprisal estimates from these teacher models to create a curriculum. We
found tentative evidence that our curriculum made it easier for models to
acquire linguistic knowledge from the training data: on the subset of tasks in
the BabyLM challenge suite evaluating models' grammatical knowledge of English,
models first trained on the BabyLM data curriculum and then on a few randomly
ordered training epochs performed slightly better than models trained on
randomly ordered epochs alone. This improved linguistic knowledge acquisition
did not result in better alignment with human reading behavior, however: models
trained on the BabyLM dataset (with or without a curriculum) generated
predictions that were as misaligned with human behavior as models trained on
larger less curated datasets. This suggests that training on developmentally
plausible datasets alone is likely insufficient to generate language models
capable of accurately predicting human language processing.

使用神经语言模型对人类行为进行建模在研究中获得了不同的结果。本文通过在 BabyLM 挑战中使用更贴近发展的数据集来探索实证数据和模型预测行为之间的不匹配程度。通过对 BabyLM 的数据集进行师生训练和课程设计，研究发现，虽然这种改进使得模型更容易从训练数据中获取语言知识，但并未导致模型对人类阅读行为的预测与之更加一致，这表明仅仅在发展上合理的数据集上训练模型可能不足以准确预测人类语言处理。

将神经语言模型在发展合理数据的课程中进行训练，是否可以提高与人类阅读行为的一致性？

Can training neural language models on a curriculum with developmentally  plausible data improve alignment with human reading behavior?

Recent psycholinguistic studies have drawn conflicting conclusions about the
relationship between the quality of a language model and the ability of its
surprisal estimates to predict human reading times, which has been speculated
to be due to the large gap in both the amount of training data and model
capacity across studies. The current work aims to consolidate these findings by
evaluating surprisal estimates from Transformer-based language model variants
that vary systematically in the amount of training data and model capacity on
their ability to predict human reading times. The results show that surprisal
estimates from most variants with contemporary model capacities provide the
best fit after seeing about two billion training tokens, after which they begin
to diverge from humanlike expectations. Additionally, newly-trained smaller
model variants reveal a 'tipping point' at convergence, after which the
decrease in language model perplexity begins to result in poorer fits to human
reading times. These results suggest that the massive amount of training data
is mainly responsible for the poorer fit achieved by surprisal from larger
pre-trained language models, and that a certain degree of model capacity is
necessary for Transformer-based language models to capture humanlike
expectations.

本文研究了基于 Transformer 的语言模型中，各种训练数据和不同容量的模型对于预测人类阅读时间的作用，并发现多数具有当代模型能力的变体，使用约 20 亿个训练标记后，所给出的 surprisal estimates 提供了最佳适合度，而较大的预先训练语言模型的较差适合度主要归咎于大量的训练数据，而 transformer-based 语言模型的某种程度的模型容量对于模型要捕捉类似于人类的期望是必要的。

基于 Transformer 的语言模型惊奇度在使用约 20 亿训练令牌时最能预测人类阅读时间

Transformer-Based LM Surprisal Predicts Human Reading Times Best with  About Two Billion Training Tokens

Transformer-based language models have shown strong performance on an array
of natural language understanding tasks. However, the question of how these
models react to implicit meaning has been largely unexplored. We investigate
this using the complement coercion phenomenon, which involves sentences like
"The student finished the book about sailing" where the action "reading" is
implicit. We compare LMs' surprisal estimates at various critical sentence
regions in sentences with and without implicit meaning. Effects associated with
recovering implicit meaning were found at a critical region other than where
sentences minimally differ. We then use follow-up experiments to factor out
potential confounds, revealing different perspectives that offer a richer and
more accurate picture.

本文针对变形金刚自然语言处理模型对于隐含意义的反应进行了研究，比较了存在和不存在隐含意义的句子中关键句的 surprisal 估计值，并在关键句所在位置及最小差异处发现了不同的影响。此外，作者使用了后续实验来排除潜在混淆因素，并揭示了不同视角对问题的更丰富且准确的解释。