Large Language Models (LLMs) have been shown to be effective models of the
human language system, with some models predicting most explainable variance of
brain activity in current datasets. Even in untrained models, the
representations induced by architectural priors can exhibit reasonable
alignment to brain data. In this work, we investigate the key architectural
components driving the surprising alignment of untrained models. To estimate
LLM-to-brain similarity, we first select language-selective units within an
LLM, similar to how neuroscientists identify the language network in the human
brain. We then benchmark the brain alignment of these LLM units across five
different brain recording datasets. By isolating critical components of the
Transformer architecture, we identify tokenization strategy and multihead
attention as the two major components driving brain alignment. A simple form of
recurrence further improves alignment. We further demonstrate this quantitative
brain alignment of our model by reproducing landmark studies in the language
neuroscience field, showing that localized model units -- just like language
voxels measured empirically in the human brain -- discriminate more reliably
between lexical than syntactic differences, and exhibit similar response
profiles under the same experimental conditions. Finally, we demonstrate the
utility of our model's representations for language modeling, achieving
improved sample and parameter efficiency over comparable architectures. Our
model's estimates of surprisal sets a new state-of-the-art in the behavioral
alignment to human reading times. Taken together, we propose a highly brain-
and behaviorally-aligned model that conceptualizes the human language system as
an untrained shallow feature encoder, with structural priors, combined with a
trained decoder to achieve efficient and performant language processing.

通过研究大型语言模型，该论文揭示了语言模型与人类大脑的相似性，重点分析了架构组件中的分词策略和多头注意力以及需求确定性的关键因素，最终提出了一种高度与人类大脑和行为对齐的模型。

基于浅层未训练多头注意力网络的类脑语言处理

Brain-Like Language Processing via a Shallow Untrained Multihead  Attention Network

In this paper, we introduce \emph{refined Direct Preference Optimization}
(rDPO), a method for improving the behavioral alignment of Large Language
Models (LLMs) without the need for human-annotated data. The method involves
creating synthetic data using self-critique prompting by a teacher LLM and then
utilising a generalized DPO loss function to distil to a student LLM. The loss
function incorporates an additional external reward model to improve the
quality of synthetic data, making rDPO robust to potential noise in the
synthetic dataset. rDPO is shown to be effective in a diverse set of
behavioural alignment tasks, such as improved safety, robustness against
role-playing, and reduced sycophancy. Code to be released at
this https URL

提出一种称为 “rDPO” 的方法，通过自我批评引导创建合成数据，并利用广义的 DPO 损失函数蒸馏为学生 LLM，其中使用额外的外部奖励模型提高合成数据质量，从而改善大型语言模型的行为对齐。