Perturbation-based explanation methods such as LIME and SHAP are commonly
applied to text classification. This work focuses on their extension to
generative language models. To address the challenges of text as output and
long text inputs, we propose a general framework called MExGen that can be
instantiated with different attribution algorithms. To handle text output, we
introduce the notion of scalarizers for mapping text to real numbers and
investigate multiple possibilities. To handle long inputs, we take a
multi-level approach, proceeding from coarser levels of granularity to finer
ones, and focus on algorithms with linear scaling in model queries. We conduct
a systematic evaluation, both automated and human, of perturbation-based
attribution methods for summarization and context-grounded question answering.
The results show that our framework can provide more locally faithful
explanations of generated outputs.

我们提出了一个名为 MExGen 的通用框架，可以扩展文本分类中的扰动解释方法（如 LIME 和 SHAP）以应对生成语言模型的挑战，该框架可用于不同的归因算法，并通过标量化器将文本映射到实数处理文本输出，同时采用多层级方法处理长输入，通过从粗粒度到细粒度的方式专注于具有模型查询线性扩展的算法，并进行系统评估，结果表明我们的框架能够提供更本地准确的生成输出解释。

生成语言模型的多层解释

Multi-Level Explanations for Generative Language Models

Effectively training language models on long inputs poses many technical
challenges. As a cost consideration, languages models are pretrained on a fixed
sequence length before being adapted to longer sequences. We explore various
methods for adapting models to longer inputs by training on segmented sequences
and an interpolation-based method for extending absolute positional embeddings.
We develop a training procedure to extend the input context size of pretrained
models with no architectural changes and no additional memory costs than
training on the original input lengths. By sub-sampling segments from long
inputs while maintaining their original position the model is able to learn new
positional interactions. Our method benefits both models trained with absolute
positional embeddings, by extending their input contexts, as well as popular
relative positional embedding methods showing a reduced perplexity on sequences
longer than they were trained on. We demonstrate our method can extend input
contexts by a factor of 4x while improving perplexity.

在没有架构更改和额外存储成本的情况下，通过对分段序列的训练和基于插值的方法来扩展绝对位置嵌入，我们开发了一种训练过程，以扩展预训练模型的输入上下文大小。我们的方法能够将输入上下文扩展 4 倍，同时改善困惑度。

通过在分段序列上训练扩展语言模型的输入上下文

Extending Input Contexts of Language Models through Training on  Segmented Sequences

While the self-attention mechanism has been widely used in a wide variety of
tasks, it has the unfortunate property of a quadratic cost with respect to the
input length, which makes it difficult to deal with long inputs. In this paper,
we present a method for accelerating and structuring self-attentions: Sparse
Adaptive Connection (SAC). In SAC, we regard the input sequence as a graph and
attention operations are performed between linked nodes. In contrast with
previous self-attention models with pre-defined structures (edges), the model
learns to construct attention edges to improve task-specific performances. In
this way, the model is able to select the most salient nodes and reduce the
quadratic complexity regardless of the sequence length. Based on SAC, we show
that previous variants of self-attention models are its special cases. Through
extensive experiments on neural machine translation, language modeling, graph
representation learning and image classification, we demonstrate SAC is
competitive with state-of-the-art models while significantly reducing memory
cost.

本文介绍 Sparse Adaptive Connection（SAC）方法，将输入序列视为图，并通过构建关注边，以改进任务特定性能，优化了自注意力机制。通过在图形表示学习和图像分类中实施的广泛实验，证明了 SAC 在减少内存成本的同时，可以与最先进的模型竞争。

SAC: 通过稀疏自适应连接加速和结构化自注意力

SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive  Connection

Recurrent sequence generators conditioned on input data through an attention
mechanism have recently shown very good performance on a range of tasks in-
cluding machine translation, handwriting synthesis and image caption gen-
eration. We extend the attention-mechanism with features needed for speech
recognition. We show that while an adaptation of the model used for machine
translation in reaches a competitive 18.7% phoneme error rate (PER) on the
TIMIT phoneme recognition task, it can only be applied to utterances which are
roughly as long as the ones it was trained on. We offer a qualitative
explanation of this failure and propose a novel and generic method of adding
location-awareness to the attention mechanism to alleviate this issue. The new
method yields a model that is robust to long inputs and achieves 18% PER in
single utterances and 20% in 10-times longer (repeated) utterances. Finally, we
propose a change to the at- tention mechanism that prevents it from
concentrating too much on single frames, which further reduces PER to 17.6%
level.

本研究提出了一种基于改进的注意力机制加上位置感知的模型，解决了长输入音频识别中的问题并且有效降低了音素错误率。