Since their release, Transformers have revolutionized many fields from
Natural Language Understanding to Computer Vision. Document Understanding (DU)
was not left behind with first Transformer based models for DU dating from late
2019. However, the computational complexity of the self-attention operation
limits their capabilities to small sequences. In this paper we explore multiple
strategies to apply Transformer based models to long multi-page documents. We
introduce 2 new multi-modal (text + layout) long-range models for DU. They are
based on efficient implementations of Transformers for long sequences.
Long-range models can process whole documents at once effectively and are less
impaired by the document's length. We compare them to LayoutLM, a classical
Transformer adapted for DU and pre-trained on millions of documents. We further
propose 2D relative attention bias to guide self-attention towards relevant
tokens without harming model efficiency. We observe improvements on multi-page
business documents on Information Retrieval for a small performance cost on
smaller sequences. Relative 2D attention revealed to be effective on dense text
for both normal and long-range models.

自从发布以来，Transformer 已经在许多领域中进行了革命，从自然语言理解到计算机视觉。然而，自注意力操作的计算复杂性限制了其处理大序列的能力。本文探讨了多种策略，以将基于 Transformer 的模型应用于长篇多页文档的情况。我们引入了两种新的多模态（文本 + 布局）长程模型，它们基于针对长序列的高效 Transformer 实现。长程模型可有效地一次处理整个文档，并且对文档的长度不那么敏感。我们将其与 LayoutLM 进行了比较，LayoutLM 是一种经过调整以适应文档理解并在数百万文档上进行预训练的经典 Transformer。我们进一步提出了 2D 相对注意力偏置，以引导自注意力指向相关的标记，同时不影响模型效率。我们观察到，在信息检索方面，对于多页商业文档，可以在较小的序列上带来小幅性能提升。相对的 2D 注意力在密集文本上对普通和长程模型均有效。

文档理解的长程 Transformer 架构

Long-Range Transformer Architectures for Document Understanding

Due to increasing interest in adapting models on resource-constrained edges,
parameter-efficient transfer learning has been widely explored. Among various
methods, Visual Prompt Tuning (VPT), prepending learnable prompts to input
space, shows competitive fine-tuning performance compared to training of full
network parameters. However, VPT increases the number of input tokens,
resulting in additional computational overhead. In this paper, we analyze the
impact of the number of prompts on fine-tuning performance and self-attention
operation in a vision transformer architecture. Through theoretical and
empirical analysis we show that adding more prompts does not lead to linear
performance improvement. Further, we propose a Prompt Condensation (PC)
technique that aims to prevent performance degradation from using a small
number of prompts. We validate our methods on FGVC and VTAB-1k tasks and show
that our approach reduces the number of prompts by ~70% while maintaining
accuracy.

本文研究了视觉转换器结构中 Prompt 数量对微调效果和自注意力操作的影响。通过理论和实证分析，我们发现增加 Prompt 数量并不能带来线性的性能提升。为此，我们提出 Prompt Condensation 技术来防止 Prompt 数量过多导致的性能下降，实验证明我们的方法在维持准确度的同时能减少大约 70% 的 prompts 数量。

我们真的需要大量的视觉提示吗？

Do We Really Need a Large Number of Visual Prompts?

Modelling long-range dependencies is critical for scene understanding tasks
in computer vision. Although convolution neural networks (CNNs) have excelled
in many vision tasks, they are still limited in capturing long-range structured
relationships as they typically consist of layers of local kernels. A
fully-connected graph, such as the self-attention operation in Transformers, is
beneficial for such modelling, however, its computational overhead is
prohibitive. In this paper, we propose a dynamic graph message passing network,
that significantly reduces the computational complexity compared to related
works modelling a fully-connected graph. This is achieved by adaptively
sampling nodes in the graph, conditioned on the input, for message passing.
Based on the sampled nodes, we dynamically predict node-dependent filter
weights and the affinity matrix for propagating information between them. This
formulation allows us to design a self-attention module, and more importantly a
new Transformer-based backbone network, that we use for both image
classification pretraining, and for addressing various downstream tasks (object
detection, instance and semantic segmentation). Using this model, we show
significant improvements with respect to strong, state-of-the-art baselines on
four different tasks. Our approach also outperforms fully-connected graphs
while using substantially fewer floating-point operations and parameters. Code
and models will be made publicly available at
this https URL

本论文提出了一种动态图消息传递网络，用于进行长程依赖性建模，以用于图像识别。该网络采用自适应抽样节点的方法，在传递信息时动态地预测节点相关过滤器权重和关联矩阵，以实现对自我注意机制的设计。研究结果表明，基于该模型的 Transformer 骨干网络在图像分类和物体检测等四种不同任务上，相对于现有的最先进技术，能够显著提升性能，同时性能优于完全连接的图并使用更少的浮点运算和参数。