Parameter-efficient tuning aims to mitigate the large memory requirements of
adapting pretrained language models for downstream tasks. For example, one
popular method, prefix-tuning, prepends trainable tokens to sequences while
freezing the rest of the model's parameters. Although such models attain
comparable performance with fine-tuning when applied to sequences with short to
moderate lengths, we show their inferior performance when modelling long
sequences. To bridge this gap, we propose prefix-propagation, a simple but
effective approach that conditions prefixes on previous hidden states. We
empirically demonstrate that prefix-propagation outperforms prefix-tuning
across long-document tasks, while using 50% fewer parameters. To further
investigate the proposed architecture, we also show its advantage in
calibration, and perform additional study on its relationship with kernel
attention. To the best of our knowledge, this work is the first to focus on
parameter-efficient learning for long-sequence language tasks.

本论文探讨了为长序列语言任务实现参数有效学习的方法，提出了一种基于前缀传播的简单且有效的方法，在校准和核注意力方面表现出优越性，并且使用的参数比前缀调整少 50%。

前缀传播：用于长序列的参数有效调整

Prefix Propagation: Parameter-Efficient Tuning for Long Sequences

Transformers have improved the state-of-the-art across numerous tasks in
sequence modeling. Besides the quadratic computational and memory complexity
w.r.t the sequence length, the self-attention mechanism only processes
information at the same scale, i.e., all attention heads are in the same
resolution, resulting in the limited power of the Transformer. To remedy this,
we propose a novel and efficient structure named Adaptive Multi-Resolution
Attention (AdaMRA for short), which scales linearly to sequence length in terms
of time and space. Specifically, we leverage a multi-resolution multi-head
attention mechanism, enabling attention heads to capture long-range contextual
information in a coarse-to-fine fashion. Moreover, to capture the potential
relations between query representation and clues of different attention
granularities, we leave the decision of which resolution of attention to use to
query, which further improves the model's capacity compared to vanilla
Transformer. In an effort to reduce complexity, we adopt kernel attention
without degrading the performance. Extensive experiments on several benchmarks
demonstrate the effectiveness and efficiency of our model by achieving a
state-of-the-art performance-efficiency-memory trade-off. To facilitate AdaMRA
utilization by the scientific community, the code implementation will be made
publicly available.

本文介绍了一种名为 Adaptive Multi-Resolution Attention（AdaMRA）的新型高效 Transformer 结构，利用多分辨率多头自注意机制，采用核注意力且时间空间都线性地缩放，进一步提高了模型的处理能力。在多个基准测试中取得了最新的性能和效率。