Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence. Nevertheless, VLM models supporting multi-language, e.g., in both Chinese and English, have lagged due to the relative scarcity of large-scale pretraining datasets. Toward this end, we introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs, aimed at enhancing multimodal foundation models to well understand images in both languages. To handle such a scale of dataset, we propose a novel grouped aggregation approach for image-text contrastive loss computation, which reduces the communication overhead and GPU memory demands significantly, facilitating a 60% increase in training speed. We pretrain a series of bilingual image-text foundation models with an enhanced fine-grained understanding ability on BM-6B, the resulting models, dubbed as $M^2$-Encoders (pronounced "M-Square"), set new benchmarks in both languages for multimodal retrieval and classification tasks. Notably, Our largest $M^2$-Encoder-10B model has achieved top-1 accuracies of 88.5% on ImageNet and 80.7% on ImageNet-CN under a zero-shot classification setting, surpassing previously reported SoTA methods by 2.2% and 21.1%, respectively. The $M^2$-Encoder series represents one of the most comprehensive bilingual image-text foundation models to date, so we are making it available to the research community for further exploration and development.

我们介绍了一个包含60亿个图像-文本配对的双语（中英文）数据集BM-6B，通过提出一种新颖的分组聚合方法来处理此规模的数据集，大大减少了通信开销和GPU内存需求，从而提高了训练速度，我们预训练了一系列双语图像-文本基础模型，并在BM-6B上取得了提升视觉和文本理解能力的成果，这些模型在多模态检索和分类任务方面树立了新的基准，并且我们的最大模型在零样本分类设置下，在ImageNet上的top-1准确率分别超过了以前报道的SoTA方法2.2%和21.1%。

M^2-Encoder: 大规模高效预训练推动双语图像-文本理解