The Transformer architecture aggregates input information through the self-attention mechanism, but there is no clear understanding of how this information is mixed across the entire model. Additionally, recent works have demonstrated that attention weights alone are not enough to describe the flow of information. In this paper, we consider the whole attention block --multi-head attention, residual connection, and layer normalization-- and define a metric to measure token-to-token interactions within each layer, considering the characteristics of the representation space. Then, we aggregate layer-wise interpretations to provide input attribution scores for model predictions. Experimentally, we show that our method, ALTI (Aggregation of Layer-wise Token-to-token Interactions), provides faithful explanations and outperforms similar aggregation methods.

该论文提出了一种名为ALTI的方法，通过考虑注意力模块（multi-head attention, residual connection and layer normalization）以及定义一种新的度量方法来测量各层之间的令牌交互，从而提供更准确的输入归因分数以解释模型预测，在实验中显示ALTI比基于梯度方法更好地提供了模型预测的解释，提高了模型的鲁棒性。

测量Transformer中上下文信息的混合