We investigate the extent to which individual attention heads in pretrained
transformer language models, such as BERT and RoBERTa, implicitly capture
syntactic dependency relations. We employ two methods---taking the maximum
attention weight and computing the maximum spanning tree---to extract implicit
dependency relations from the attention weights of each layer/head, and compare
them to the ground-truth Universal Dependency (UD) trees. We show that, for
some UD relation types, there exist heads that can recover the dependency type
significantly better than baselines on parsed English text, suggesting that
some self-attention heads act as a proxy for syntactic structure. We also
analyze BERT fine-tuned on two datasets---the syntax-oriented CoLA and the
semantics-oriented MNLI---to investigate whether fine-tuning affects the
patterns of their self-attention, but we do not observe substantial differences
in the overall dependency relations extracted using our methods. Our results
suggest that these models have some specialist attention heads that track
individual dependency types, but no generalist head that performs holistic
parsing significantly better than a trivial baseline, and that analyzing
attention weights directly may not reveal much of the syntactic knowledge that
BERT-style models are known to learn.

本研究探讨预训练变形金刚语言模型中的注意头在多大程度上隐含捕获了句法依赖关系，并使用两种方法提取每层 / 头 attention 权重中的隐含依赖关系，比较它们与基准 UD 树的差异。结果表明，这些模型有一些跟踪特定依赖类型的专业注意头，但没有表现出比浅显的基准模型更好的整体解析能力。同时，注意权重直接分析不能揭示 BERT-Style 模型已知的语法知识。

BERT 中的 Attention 头是否跟踪句法依赖关系？

Do Attention Heads in BERT Track Syntactic Dependencies?

We present BlockBERT, a lightweight and efficient BERT model for better
modeling long-distance dependencies. Our model extends BERT by introducing
sparse block structures into the attention matrix to reduce both memory
consumption and training/inference time, which also enables attention heads to
capture either short- or long-range contextual information. We conduct
experiments on language model pre-training and several benchmark question
answering datasets with various paragraph lengths. BlockBERT uses 18.7-36.1%
less memory and 12.0-25.1% less time to learn the model. During testing,
BlockBERT saves 27.8% inference time, while having comparable and sometimes
better prediction accuracy, compared to an advanced BERT-based model, RoBERTa.

BlockBERT 为一个高效的 BERT 模型，引入了稀疏块结构，以更好的建模长距离依赖关系，在语言模型预训练和基准问答数据集上进行了实验。与 RoBERTa 模型相比，BlockBERT 可以节省大约 27.8% 的推理时间，并具有可比甚至更好的预测准确性。

基于分块的自注意力机制实现长文本理解

Blockwise Self-Attention for Long Document Understanding

Pre-training Transformer from large-scale raw texts and fine-tuning on the
desired task have achieved state-of-the-art results on diverse NLP tasks.
However, it is unclear what the learned attention captures. The attention
computed by attention heads seems not to match human intuitions about
hierarchical structures. This paper proposes Tree Transformer, which adds an
extra constraint to attention heads of the bidirectional Transformer encoder in
order to encourage the attention heads to follow tree structures. The tree
structures can be automatically induced from raw texts by our proposed
"Constituent Attention" module, which is simply implemented by self-attention
between two adjacent words. With the same training procedure identical to BERT,
the experiments demonstrate the effectiveness of Tree Transformer in terms of
inducing tree structures, better language modeling, and further learning more
explainable attention scores.

使用自注意力机制来诱导目标树结构，从而产生更好的语言模型，更可解释的注意力分数，并达到实验效果的 Transformer 变种 (Tree Transformer) 的提出。

树形 Transformer：将树形结构集成到自注意力中

Tree Transformer: Integrating Tree Structures into Self-Attention

Learning algorithms become more powerful, often at the cost of increased
complexity. In response, the demand for algorithms to be transparent is
growing. In NLP tasks, attention distributions learned by attention-based deep
learning models are used to gain insights in the models' behavior. To which
extent is this perspective valid for all NLP tasks? We investigate whether
distributions calculated by different attention heads in a transformer
architecture can be used to improve transparency in the task of abstractive
summarization. To this end, we present both a qualitative and quantitative
analysis to investigate the behavior of the attention heads. We show that some
attention heads indeed specialize towards syntactically and semantically
distinct input. We propose an approach to evaluate to which extent the
Transformer model relies on specifically learned attention distributions. We
also discuss what this implies for using attention distributions as a means of
transparency.

通过分析 transformer 中 attention heads 的分布，我们探讨了并提出一种方法来评估 Transformer 模型特定 attention distributions 的依赖程度，从而讨论了使用注意分布作为可解释性手段的含义。在某些 attention heads 确实专门用于句法和语义不同的输入的基础上，我们提出了一种贡献方式，以提高其可解释性，这对于所有 NLP 任务是否有效等进行了讨论。