Post-hoc explanation methods have often been criticised for abstracting away
the decision-making process of deep neural networks. In this work, we would
like to provide natural language descriptions for what different layers of a
vision backbone have learned. Our DeViL method decodes vision features into
language, not only highlighting the attribution locations but also generating
textual descriptions of visual features at different layers of the network. We
train a transformer network to translate individual image features of any
vision layer into a prompt that a separate off-the-shelf language model decodes
into natural language. By employing dropout both per-layer and
per-spatial-location, our model can generalize training on image-text pairs to
generate localized explanations. As it uses a pre-trained language model, our
approach is fast to train, can be applied to any vision backbone, and produces
textual descriptions at different layers of the vision network. Moreover, DeViL
can create open-vocabulary attribution maps corresponding to words or phrases
even outside the training scope of the vision model. We demonstrate that DeViL
generates textual descriptions relevant to the image content on CC3M surpassing
previous lightweight captioning models and attribution maps uncovering the
learned concepts of the vision backbone. Finally, we show DeViL also
outperforms the current state-of-the-art on the neuron-wise descriptions of the
MILANNOTATIONS dataset. Code available at
this https URL

我们提出了 DeViL 方法，该方法利用后续解释方法在深度神经网络决策过程中提供了自然语言描述，通过将视觉特征解码为语言，突显了不同层次的网络中视觉特征的归属位置，并在图像和文本之间进行转换，生成视觉网络不同层次的文本描述。

DeViL: 将视觉特征解码为语言

DeViL: Decoding Vision features into Language

This paper studies how to keep a vision backbone effective while removing
token mixers in its basic building blocks. Token mixers, as self-attention for
vision transformers (ViTs), are intended to perform information communication
between different spatial tokens but suffer from considerable computational
cost and latency. However, directly removing them will lead to an incomplete
model structure prior, and thus brings a significant accuracy drop. To this
end, we first develop an RepIdentityFormer base on the re-parameterizing idea,
to study the token mixer free model architecture. And we then explore the
improved learning paradigm to break the limitation of simple token mixer free
backbone, and summarize the empirical practice into 5 guidelines. Equipped with
the proposed optimization strategy, we are able to build an extremely simple
vision backbone with encouraging performance, while enjoying the high
efficiency during inference. Extensive experiments and ablative analysis also
demonstrate that the inductive bias of network architecture, can be
incorporated into simple network structure with appropriate optimization
strategy. We hope this work can serve as a starting point for the exploration
of optimization-driven efficient network design. Project page:
this https URL

本文研究如何在去除基本构建模块中的令牌混合器的同时保持视觉骨干的有效性，并提出了一个可行的优化策略，使得我们能够构建一种极其简单的视觉骨干，具有鼓舞人心的性能，同时在推理过程中享受高效性。

RIFormer：在不使用 Token Mixer 功能的同时保持视觉骨干网络的有效性

RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer

This paper presents a simple and effective approach to solving the
multi-label classification problem. The proposed approach leverages Transformer
decoders to query the existence of a class label. The use of Transformer is
rooted in the need of extracting local discriminative features adaptively for
different labels, which is a strongly desired property due to the existence of
multiple objects in one image. The built-in cross-attention module in the
Transformer decoder offers an effective way to use label embeddings as queries
to probe and pool class-related features from a feature map computed by a
vision backbone for subsequent binary classifications. Compared with prior
works, the new framework is simple, using standard Transformers and vision
backbones, and effective, consistently outperforming all previous works on five
multi-label classification data sets, including MS-COCO, PASCAL VOC, NUS-WIDE,
and Visual Genome. Particularly, we establish $91.3\%$ mAP on MS-COCO. We hope
its compact structure, simple implementation, and superior performance serve as
a strong baseline for multi-label classification tasks and future studies. The
code will be available soon at this https URL

本文提出了一种简单而有效的方法来解决多标签分类问题，该方法利用 Transformer 解码器查询类标签的存在，并使用视觉骨干计算的特征图来进行后续的二进制分类，相比于以前的工作，该方法更为简单有效，对于五个多标签分类数据集，包括 MS-COCO，PASCAL VOC，NUS-WIDE 和 Visual Genome，始终优于以前的所有工作，我们在 MS-COCO 上建立了 91.3％的 mAP。