Building scalable vision-language models to learn from diverse, multimodal
data remains an open challenge. In this paper, we introduce an Efficient
Vision-languagE foundation model, namely EVE, which is one unified multimodal
Transformer pre-trained solely by one unified pre-training task. Specifically,
EVE encodes both vision and language within a shared Transformer network
integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which
capture modality-specific information by selectively switching to different
experts. To unify pre-training tasks of vision and language, EVE performs
masked signal modeling on image-text pairs to reconstruct masked signals, i.e.,
image pixels and text tokens, given visible signals. This simple yet effective
pre-training objective accelerates training by 3.5x compared to the model
pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing
to the combination of the unified architecture and pre-training task, EVE is
easy to scale up, enabling better downstream performance with fewer resources
and faster training speed. Despite its simplicity, EVE achieves
state-of-the-art performance on various vision-language downstream tasks,
including visual question answering, visual reasoning, and image-text
retrieval.

本文介绍了一种高效的视觉 - 语言基础模型 EVE，该模型通过一个统一的预训练任务，在共享的 Transformer 网络中编码了视觉和语言，并利用专注于模态性的稀疏的 Mixture-of-Experts 模块来捕捉模态特定信息。通过对图像 - 文本对进行遮蔽信号建模，EVE 实现了图像像素和文本标记的信号重构，从而达到快速训练和更好的下游性能。