Learned self-attention functions in state-of-the-art NLP models often
correlate with human attention. We investigate whether self-attention in
large-scale pre-trained language models is as predictive of human eye fixation
patterns during task-reading as classical cognitive models of human attention.
We compare attention functions across two task-specific reading datasets for
sentiment analysis and relation extraction. We find the predictiveness of
large-scale pre-trained self-attention for human attention depends on `what is
in the tail', e.g., the syntactic nature of rare contexts. Further, we observe
that task-specific fine-tuning does not increase the correlation with human
task-specific reading. Through an input reduction experiment we give
complementary insights on the sparsity and fidelity trade-off, showing that
lower-entropy attention vectors are more faithful.

通过比较两个任务特定的阅读数据集，研究表明，大规模预训练自注意力模型对于人类注意力的预测能力依赖于罕见语境的句法性质，而任务特定的微调不增加与人类阅读的相关性，并且通过输入减少实验给出了互补信息，表明低熵的注意向量更为可靠。

Transformer 模型是否显示出与任务特定的人类凝视类似的注意力模式？

Do Transformer Models Show Similar Attention Patterns to Task-Specific Human Gaze?

Gradient-based analysis methods, such as saliency map visualizations and
adversarial input perturbations, have found widespread use in interpreting
neural NLP models due to their simplicity, flexibility, and most importantly,
their faithfulness. In this paper, however, we demonstrate that the gradients
of a model are easily manipulable, and thus bring into question the reliability
of gradient-based analyses. In particular, we merge the layers of a target
model with a Facade that overwhelms the gradients without affecting the
predictions. This Facade can be trained to have gradients that are misleading
and irrelevant to the task, such as focusing only on the stop words in the
input. On a variety of NLP tasks (text classification, NLI, and QA), we show
that our method can manipulate numerous gradient-based analysis techniques:
saliency maps, input reduction, and adversarial perturbations all identify
unimportant or targeted tokens as being highly important. The code and a
tutorial of this paper is available at this http URL

本文研究了神经网络自然语言处理模型的可解释性，特别是基于梯度的分析方法。我们发现，这些分析方法的梯度很容易被劫持，具有误导性。结合多项自然语言处理任务的实验结果，本文提出一种基于覆盖层的方法来干扰和欺骗这些梯度。

基于梯度的 NLP 模型分析易受操控

Gradient-based Analysis of NLP Models is Manipulable

One way to interpret neural model predictions is to highlight the most
important input features---for example, a heatmap visualization over the words
in an input sentence. In existing interpretation methods for NLP, a word's
importance is determined by either input perturbation---measuring the decrease
in model confidence when that word is removed---or by the gradient with respect
to that word. To understand the limitations of these methods, we use input
reduction, which iteratively removes the least important word from the input.
This exposes pathological behaviors of neural models: the remaining words
appear nonsensical to humans and are not the ones determined as important by
interpretation methods. As we confirm with human experiments, the reduced
examples lack information to support the prediction of any label, but models
still make the same predictions with high confidence. To explain these
counterintuitive results, we draw connections to adversarial examples and
confidence calibration: pathological behaviors reveal difficulties in
interpreting neural models trained with maximum likelihood. To mitigate their
deficiencies, we fine-tune the models by encouraging high entropy outputs on
reduced examples. Fine-tuned models become more interpretable under input
reduction without accuracy loss on regular examples.

通过 input reduction 方法研究了神经网络模型的缺陷，发现在面对异常数据时大多数模型都表现出困难并难以解释，提出了一种 fine-tuning 方法，通过提高模型的输出熵，增强模型的可解释性。