TL;DR本文提出了一种名为query and attend(QnA)的新型shift-invariant local attention层,将其并入分层视觉transformer模型,并证明其在速度和内存复杂度方面的改善,同时又能实现与最先进的模型相当的准确度。
Abstract
vision transformers (ViT) serve as powerful vision models. Unlike convolutional neural networks, which dominated vision research in previous years, vision transformers enjoy the ability to capture long-range depe