This paper presents a Multitask Multilingual Multimodal Pre-trained model (M3P) that combines multilingual-monomodal pre-training and monolingual-multimodal pre-training into a unified framework via multitask learning and weight sharing. The model learns universal representations that can map objects that occurred in different modalities or expressed in different languages to vectors in a common semantic space. To verify the generalization capability of M3P, we fine-tune the pre-trained model for different types of downstream tasks: multilingual image-text retrieval, multilingual image captioning, multimodal machine translation, multilingual natural language inference and multilingual text generation. Evaluation shows that M3P can (i) achieve comparable results on multilingual tasks and English multimodal tasks, compared to the state-of-the-art models pre-trained for these two types of tasks separately, and (ii) obtain new state-of-the-art results on non-English multimodal tasks in the zero-shot or few-shot setting. We also build a new Multilingual Image-Language Dataset (MILD) by collecting large amounts of (text-query, image, context) triplets in 8 languages from the logs of a commercial search engine

M3P是一个多任务多语言多模态预训练模型，通过多任务预训练将多语言预训练和多模态预训练结合到一个统一的框架中。该模型的目标是学习通用表示法，可以将出现在不同模态或不同语言中的对象映射到一个公共的语义空间。此外，该论文还提出了Multimodal Code-switched Training（MCT）的训练策略，该策略通过代码切换将单语预训练和多模态预训练相结合，以明确地鼓励图像和非英语语言之间的细粒度对齐。在跨两个基准数据集的多语言图像检索任务上进行了实验，包括MSCOCO和Multi30K。M3P在英语上可以获得可比较的结果，在非英语语言上则获得了最新的最佳结果。

M3P：通过多任务、多语言、多模态的预训练学习通用表示