Large Language Models (LLMs) have been shown to be effective models of the
human language system, with some models predicting most explainable variance of
brain activity in current datasets. Even in untrained models, the
representations induced by architectural priors can exhibit reasonable
alignment to brain data. In this work, we investigate the key architectural
components driving the surprising alignment of untrained models. To estimate
LLM-to-brain similarity, we first select language-selective units within an
LLM, similar to how neuroscientists identify the language network in the human
brain. We then benchmark the brain alignment of these LLM units across five
different brain recording datasets. By isolating critical components of the
Transformer architecture, we identify tokenization strategy and multihead
attention as the two major components driving brain alignment. A simple form of
recurrence further improves alignment. We further demonstrate this quantitative
brain alignment of our model by reproducing landmark studies in the language
neuroscience field, showing that localized model units -- just like language
voxels measured empirically in the human brain -- discriminate more reliably
between lexical than syntactic differences, and exhibit similar response
profiles under the same experimental conditions. Finally, we demonstrate the
utility of our model's representations for language modeling, achieving
improved sample and parameter efficiency over comparable architectures. Our
model's estimates of surprisal sets a new state-of-the-art in the behavioral
alignment to human reading times. Taken together, we propose a highly brain-
and behaviorally-aligned model that conceptualizes the human language system as
an untrained shallow feature encoder, with structural priors, combined with a
trained decoder to achieve efficient and performant language processing.

通过研究大型语言模型，该论文揭示了语言模型与人类大脑的相似性，重点分析了架构组件中的分词策略和多头注意力以及需求确定性的关键因素，最终提出了一种高度与人类大脑和行为对齐的模型。

基于浅层未训练多头注意力网络的类脑语言处理

Brain-Like Language Processing via a Shallow Untrained Multihead  Attention Network

Simultaneous machine translation models start generating a target sequence
before they have encoded or read the source sequence. Recent approaches for
this task either apply a fixed policy on a state-of-the art Transformer model,
or a learnable monotonic attention on a weaker recurrent neural network-based
structure. In this paper, we propose a new attention mechanism, Monotonic
Multihead Attention (MMA), which extends the monotonic attention mechanism to
multihead attention. We also introduce two novel and interpretable approaches
for latency control that are specifically designed for multiple attentions
heads. We apply MMA to the simultaneous machine translation task and
demonstrate better latency-quality tradeoffs compared to MILk, the previous
state-of-the-art approach. We also analyze how the latency controls affect the
attention span and we motivate the introduction of our model by analyzing the
effect of the number of decoder layers and heads on quality and latency.

本文提出了一种名为 Monotonic Multihead Attention（MMA）的新型注意力机制，可应用于同时翻译的机器翻译任务，并介绍了两种特定于多头注意力的新颖且可解释的延迟控制方法，与最先进的 Milk 方法相比，MMA 具有更好的延迟 - 质量平衡，同时分析了延迟控制对关注范围的影响，通过分析解码器层数和头数对质量和延迟的影响来证明模型的引入。

单调多头注意力

Monotonic Multihead Attention

The Pointer-Generator architecture has shown to be a big improvement for
abstractive summarization seq2seq models. However, the summaries produced by
this model are largely extractive as over 30% of the generated sentences are
copied from the source text. This work proposes a multihead attention
mechanism, pointer dropout, and two new loss functions to promote more
abstractive summaries while maintaining similar ROUGE scores. Both the
multihead attention and dropout do not improve N-gram novelty, however, the
dropout acts as a regularizer which improves the ROUGE score. The new loss
function achieves significantly higher novel N-grams and sentences, at the cost
of a slightly lower ROUGE score.

该研究提出了一种基于多头注意力机制、指针 dropout 和新的损失函数的方法，用于促进摘要的提取，同时保持类似 ROUGE 分数，实现了相对较高的新颖 N-gram 和句子生成率。