When dealing with the task of fine-grained scene image classification, most previous works lay much emphasis on global visual features when doing multi-modal feature fusion. In other words, models are deliberately designed based on prior intuitions about the importance of different modalities. In this paper, we present a new multi-modal feature fusion approach named MAA (Modality-Agnostic Adapter), trying to make the model learn the importance of different modalities in different cases adaptively, without giving a prior setting in the model architecture. More specifically, we eliminate the modal differences in distribution and then use a modality-agnostic Transformer encoder for a semantic-level feature fusion. Our experiments demonstrate that MAA achieves state-of-the-art results on benchmarks by applying the same modalities with previous methods. Besides, it is worth mentioning that new modalities can be easily added when using MAA and further boost the performance. Code is available at https://github.com/quniLcs/MAA.

当处理细粒度场景图像分类任务时，大多数以往的研究在进行多模态特征融合时，都非常重视全局视觉特征。换句话说，模型是基于关于不同模态重要性的先前直觉有意设计的。本文提出了一种名为MAA（模态无关适配器）的新的多模态特征融合方法，试图使模型能够自适应地学习不同情况下的不同模态的重要性，在模型架构中不提前给定设定。具体而言，我们消除了分布中的模态差异，然后使用模态无关Transformer编码器进行语义级特征融合。我们的实验证明，通过使用与以前方法相同的模态，MAA在基准测试中取得了最先进的结果。此外，值得一提的是，使用MAA时可以轻松添加新的模态并进一步提升性能。

细粒度场景图像分类的模态不可知适配器