Vision-and-language (VL) pre-training has proven to be highly effective on
various VL downstream tasks. While recent work has shown that fully
transformer-based VL models can be more efficient than previous
region-feature-based methods, their performance on downstream tasks often
degrades significantly. In this paper, we present METER, a Multimodal
End-to-end TransformER framework, through which we investigate how to design
and pre-train a fully transformer-based VL model in an end-to-end manner.
Specifically, we dissect the model designs along multiple dimensions: vision
encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa,
DeBERTa), multimodal fusion module (e.g., merged attention vs. co-attention),
architectural design (e.g., encoder-only vs. encoder-decoder), and pre-training
objectives (e.g., masked image modeling). We conduct comprehensive experiments
and provide insights on how to train a performant VL transformer. METER
achieves an accuracy of 77.64% on the VQAv2 test-std set using only 4M images
for pre-training, surpassing the state-of-the-art region-feature-based model by
1.04%, and outperforming the previous best fully transformer-based model by
1.6%. Notably, when further scaled up, our best VQA model achieves an accuracy
of 80.54%. Code and pre-trained models are released at
this https URL

该研究展示了一个名为 METER 的多模态端到端 Transformer 框架，研究了如何设计和预训练一个完全基于 Transformer 的视听模型以及它们的性能，通过在多个维度上对模型设计进行分解，并使用预训练的增强模型，达到了相对于基于区域特征的模型更好的性能，即在 VQAv2 测试数据集上取得了 77.64% 的准确率，超过了以前的最优模型，并且在最佳情况下可以达到 80.54％的准确率。