Non-hierarchical sparse attention Transformer-based models, such as
Longformer and Big Bird, are popular approaches to working with long documents.
There are clear benefits to these approaches compared to the original
Transformer in terms of efficiency, but Hierarchical Attention Transformer
(HAT) models are a vastly understudied alternative. We develop and release
fully pre-trained HAT models that use segment-wise followed by cross-segment
encoders and compare them with Longformer models and partially pre-trained
HATs. In several long document downstream classification tasks, our best HAT
model outperforms equally-sized Longformer models while using 10-20% less GPU
memory and processing documents 40-45% faster. In a series of ablation studies,
we find that HATs perform best with cross-segment contextualization throughout
the model than alternative configurations that implement either early or late
cross-segment contextualization. Our code is on GitHub:
this https URL.

本研究开发并发布了使用分段编码器，并将其与 Longformer 模型和部分预训练的 HAT 进行比较的完全预训练 HAT 模型，在多个长文档下游分类任务中，我们的最佳 HAT 模型在使用 10-20％ GPU 内存的情况下比同等大小的 Longformer 模型更快地处理文档并实现更好的性能。在消融研究中，发现 HAT 在整个模型中进行跨段上下文信息处理比其他配置的早期或晚期跨段上下文处理性能更好。

基于分层注意力机制的高效长文档分类探索

An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification

One of the challenges for current sequence to sequence (seq2seq) models is
processing long sequences, such as those in summarization and document level
machine translation tasks. These tasks require the model to reason at the token
level as well as the sentence and paragraph level. We design and study a new
Hierarchical Attention Transformer-based architecture (HAT) that outperforms
standard Transformers on several sequence to sequence tasks. Furthermore, our
model achieves state-of-the-art ROUGE scores on four summarization tasks,
including PubMed, arXiv, CNN/DM, SAMSum, and AMI. Our model outperforms
document-level machine translation baseline on the WMT20 English to German
translation task. We investigate what the hierarchical layers learn by
visualizing the hierarchical encoder-decoder attention. Finally, we study
hierarchical learning on encoder-only pre-training and analyze its performance
on classification tasks.

本研究设计并研究了一种新的分层注意力 Transformer 架构（HAT），在几个序列到序列任务中优于标准 Transformer，包括在 PubMed、arXiv、CNN/DM、SAMSum 和 AMI 上的四个摘要任务中取得了最新的 ROUGE 分数。该架构在 WMT20 英文到德文翻译任务中优于文档级机器翻译基线，并通过可视化分层编解码器注意力来研究了分层层次的理解，最后研究了编码器预训练上的分层学习并分析了其在分类任务上的性能。