Many adaptations of transformers have emerged to address the single-modal
vision tasks, where self-attention modules are stacked to handle input sources
like images. Intuitively, feeding multiple modalities of data to vision
transformers could improve the performance, yet the inner-modal attentive
weights may also be diluted, which could thus undermine the final performance.
In this paper, we propose a multimodal token fusion method (TokenFusion),
tailored for transformer-based vision tasks. To effectively fuse multiple
modalities, TokenFusion dynamically detects uninformative tokens and
substitutes these tokens with projected and aggregated inter-modal features.
Residual positional alignment is also adopted to enable explicit utilization of
the inter-modal alignments after fusion. The design of TokenFusion allows the
transformer to learn correlations among multimodal features, while the
single-modal transformer architecture remains largely intact. Extensive
experiments are conducted on a variety of homogeneous and heterogeneous
modalities and demonstrate that TokenFusion surpasses state-of-the-art methods
in three typical vision tasks: multimodal image-to-image translation, RGB-depth
semantic segmentation, and 3D object detection with point cloud and images. Our
code is available at this https URL

本文提出了一个针对基于 Transformer 的视觉任务的多模态令牌融合方法（TokenFusion），可以在保持单模态 Transformer 结构基本不变的同时，学习多模态特征之间的相关性，并超越三个典型视觉任务中的最先进方法。