Transformers have elevated to the state-of-the-art vision architectures
through innovations in attention mechanism inspired from visual perception. At
present two classes of attentions prevail in vision transformers, regional and
sparse attention. The former bounds the pixel interactions within a region; the
latter spreads them across sparse grids. The opposing natures of them have
resulted in a dilemma between either preserving hierarchical relation or
attaining a global context. In this work, taking inspiration from atrous
convolution, we introduce Atrous Attention, a fusion of regional and sparse
attention, which can adaptively consolidate both local and global information,
while maintaining hierarchical relations. As a further tribute to atrous
convolution, we redesign the ubiquitous inverted residual convolution blocks
with atrous convolution. Finally, we propose a generalized, hybrid vision
transformer backbone, named ACC-ViT, following conventional practices for
standard vision tasks. Our tiny version model achieves $\sim 84 \%$ accuracy on
ImageNet-1K, with less than $28.5$ million parameters, which is $0.42\%$
improvement over state-of-the-art MaxViT while having $8.4\%$ less parameters.
In addition, we have investigated the efficacy of ACC-ViT backbone under
different evaluation settings, such as finetuning, linear probing, and
zero-shot learning on tasks involving medical image analysis, object detection,
and language-image contrastive learning. ACC-ViT is therefore a strong vision
backbone, which is also competitive in mobile-scale versions, ideal for niche
applications with small datasets.

通过从视觉感知中汲取灵感进行注意机制创新，Transformer 已经成为最先进的视觉架构。本文引入了一种融合区域和稀疏注意力的 Atrous Attention，它能够自适应地整合局部和全局信息，并保持层次关系，提出了一种通用的混合式视觉 Transformer 骨干网络 ACC-ViT，适用于标准视觉任务和移动规模版本，适用于具有小数据集的特定应用领域。

ACC-ViT: 视觉 Transformer 中空洞卷积的回归

ACC-ViT : Atrous Convolution's Comeback in Vision Transformers

There is an ever-growing zoo of modern neural network models that can
efficiently learn end-to-end control from visual observations. These advanced
deep models, ranging from convolutional to patch-based networks, have been
extensively tested on offline image classification and regression tasks. In
this paper, we study these vision architectures with respect to the open-loop
to closed-loop causality gap, i.e., offline training followed by an online
closed-loop deployment. This causality gap typically emerges in robotics
applications such as autonomous driving, where a network is trained to imitate
the control commands of a human. In this setting, two situations arise: 1)
Closed-loop testing in-distribution, where the test environment shares
properties with those of offline training data. 2) Closed-loop testing under
distribution shifts and out-of-distribution. Contrary to recently reported
results, we show that under proper training guidelines, all vision models
perform indistinguishably well on in-distribution deployment, resolving the
causality gap. In situation 2, We observe that the causality gap disrupts
performance regardless of the choice of the model architecture. Our results
imply that the causality gap can be solved in situation one with our proposed
training guideline with any modern network architecture, whereas achieving
out-of-distribution generalization (situation two) requires further
investigations, for instance, on data diversity rather than the model
architecture.

本文针对机器人应用中离线训练与在线闭环部署之间的因果差异来研究现代神经网络模型的性能表现，发现在合适的训练条件下，所有视觉网络结构在内部部署下表现相同，但在数据分布发生偏移时，无论模型选择都会失去目标，需要进一步针对数据多样性而非模型结构进行研究。