Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for
computer vision tasks, while the self-attention computation in Transformer
scales quadratically w.r.t. the input patch number. Thus, existing solutions
commonly employ down-sampling operations (e.g., average pooling) over
keys/values to dramatically reduce the computational cost. In this work, we
argue that such over-aggressive down-sampling design is not invertible and
inevitably causes information dropping especially for high-frequency components
in objects (e.g., texture details). Motivated by the wavelet theory, we
construct a new Wavelet Vision Transformer (\textbf{Wave-ViT}) that formulates
the invertible down-sampling with wavelet transforms and self-attention
learning in a unified way. This proposal enables self-attention learning with
lossless down-sampling over keys/values, facilitating the pursuing of a better
efficiency-vs-accuracy trade-off. Furthermore, inverse wavelet transforms are
leveraged to strengthen self-attention outputs by aggregating local contexts
with enlarged receptive field. We validate the superiority of Wave-ViT through
extensive experiments over multiple vision tasks (e.g., image recognition,
object detection and instance segmentation). Its performances surpass
state-of-the-art ViT backbones with comparable FLOPs. Source code is available
at https://github.com/YehLi/ImageNetModel.

本研究通过构建 Wavelet Vision Transformer 来处理多尺度视觉问题，使用小波变换实现可逆下采样，同时结合局部上下文信息提高自注意力计算结果，结果表明其在图像识别等多种任务上表现优异。

Wave-ViT：融合小波和 Transformer 的视觉表示学习

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation  Learning

This paper presents an efficient multi-scale vision Transformer, called ResT,
that capably served as a general-purpose backbone for image recognition. Unlike
existing Transformer methods, which employ standard Transformer blocks to
tackle raw images with a fixed resolution, our ResT have several advantages:
(1) A memory-efficient multi-head self-attention is built, which compresses the
memory by a simple depth-wise convolution, and projects the interaction across
the attention-heads dimension while keeping the diversity ability of
multi-heads; (2) Position encoding is constructed as spatial attention, which
is more flexible and can tackle with input images of arbitrary size without
interpolation or fine-tune; (3) Instead of the straightforward tokenization at
the beginning of each stage, we design the patch embedding as a stack of
overlapping convolution operation with stride on the 2D-reshaped token map. We
comprehensively validate ResT on image classification and downstream tasks.
Experimental results show that the proposed ResT can outperform the recently
state-of-the-art backbones by a large margin, demonstrating the potential of
ResT as strong backbones. The code and models will be made publicly available
at this https URL

本文提出了一种高效的多尺度视觉 Transformer 模型，名为 ResT，可作为图像识别的通用骨干。它通过一些优势来应对传统 Transformer 模型在应对分辨率固定的原始图像中存在的缺陷， 特别是建立了一种内存高效的多头自注意力机制、一种空间关注的位置编码方法，并将贴片嵌入设计为一系列重叠卷积运算，最终提高了大量原始图像识别和下游任务的性能。