In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified
architecture for image and video classification, as well as object detection.
We present an improved version of MViT that incorporates decomposed relative
positional embeddings and residual pooling connections. We instantiate this
architecture in five sizes and evaluate it for ImageNet classification, COCO
detection and Kinetics video recognition where it outperforms prior work. We
further compare MViTv2s' pooling attention to window attention mechanisms where
it outperforms the latter in accuracy/compute. Without bells-and-whistles,
MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on
ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1%
on Kinetics-400 video classification. Code and models are available at
this https URL

本文探究了多尺度视觉变换器 (MViTv2) 作为统一的图像和视频分类以及物体检测的架构，提出了一种改进版本的架构，将分解相对位置嵌入和残差池连接融入 MViTv2，并应用在 ImageNet 分类、COCO 检测和 Kinetics 视频识别中，取得了优异的性能，在三个领域的实验表明，相比于传统的拼合注意力机制，MViTv2 的池化类型的注意力机制可以更好地进行特征提取和信息编码。

MViTv2: 改进的多尺度视觉 Transformer 用于分类和检测

MViTv2: Improved Multiscale Vision Transformers for Classification and  Detection

In this paper, we introduce a two-level attention schema, Poolingformer, for
long document modeling. Its first level uses a smaller sliding window pattern
to aggregate information from neighbors. Its second level employs a larger
window to increase receptive fields with pooling attention to reduce both
computational cost and memory consumption. We first evaluate Poolingformer on
two long sequence QA tasks: the monolingual NQ and the multilingual TyDi QA.
Experimental results show that Poolingformer sits atop three official
leaderboards measured by F1, outperforming previous state-of-the-art models by
1.9 points (79.8 vs. 77.9) on NQ long answer, 1.9 points (79.5 vs. 77.6) on
TyDi QA passage answer, and 1.6 points (67.6 vs. 66.0) on TyDi QA minimal
answer. We further evaluate Poolingformer on a long sequence summarization
task. Experimental results on the arXiv benchmark continue to demonstrate its
superior performance.

本文介绍了一种二级注意力模式，Poolingformer，用于长文档建模，其第一级使用较小的滑动窗口模式来聚合周围信息，第二级使用更大的窗口来增加接受场并使用池化注意力来减少计算成本和内存消耗。实验结果表明，Poolingformer 在三个基准测试中领先于现有的最先进模型，在长序列 QA 任务和长序列摘要任务上表现出优异性能。