While vision-language pre-trained models (VL-PTMs) have advanced multimodal
research in recent years, their mastery in a few languages like English
restricts their applicability in broader communities. To this end, there is an
increasing interest in developing multilingual VL models via a joint-learning
setup, which, however, could be unrealistic due to expensive costs and data
availability. In this work, we propose to extend VL-PTMs' language capacity by
continual language learning (CLL), where a model needs to update its linguistic
knowledge incrementally without suffering from catastrophic forgetting (CF). We
begin our study by introducing a model dubbed CLL-CLIP, which builds upon CLIP,
a prevailing VL-PTM that has acquired image-English text alignment.
Specifically, CLL-CLIP contains an expandable token embedding layer to handle
linguistic differences. It solely trains token embeddings to improve memory
stability and is optimized under cross-modal and cross-lingual objectives to
learn the alignment between images and multilingual texts. To alleviate CF
raised by covariate shift and lexical overlap, we further propose a novel
approach that ensures the identical distribution of all token embeddings during
initialization and regularizes token embedding learning during training. We
construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600
datasets and then evaluate multilingual image-text retrieval performance.
Extensive experiments verify the effectiveness of CLL-CLIP and show that our
approach can boost CLL-CLIP, e.g., by 6.7% in text-to-image average Recall@1 on
XM3600, and improve various state-of-the-art methods consistently. Our code and
data are available at https://github.com/yangbang18/CLFM.

通过连续语言学习 (CL) 扩展视觉 - 语言预训练模型 (VL-PTMs) 的语言能力，并提出了 CLL-CLIP 模型，其通过仅训练标记嵌入来改善内存稳定性，并通过跨模态和跨语言目标进行优化以学习图像和多语种文本之间的对齐关系，实验证明该方法在多语言图像 - 文本检索性能上具有有效性。

在 CLIP 中通过持续语言学习拥抱语言包容性和多样性

Embracing Language Inclusivity and Diversity in CLIP through Continual  Language Learning

Vision-and-language pre-training has achieved impressive success in learning
multimodal representations between vision and language. To generalize this
success to non-English languages, we introduce UC2, the first machine
translation-augmented framework for cross-lingual cross-modal representation
learning. To tackle the scarcity problem of multilingual captions for image
datasets, we first augment existing English-only datasets with other languages
via machine translation (MT). Then we extend the standard Masked Language
Modeling and Image-Text Matching training objectives to multilingual setting,
where alignment between different languages is captured through shared visual
context (i.e, using image as pivot). To facilitate the learning of a joint
embedding space of images and all languages of interest, we further propose two
novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and
Visual Translation Language Modeling (VTLM), leveraging MT-enhanced translated
data. Evaluation on multilingual image-text retrieval and multilingual visual
question answering benchmarks demonstrates that our proposed framework achieves
new state-of-the-art on diverse non-English benchmarks while maintaining
comparable performance to monolingual pre-trained models on English tasks.

UC2 是第一个基于机器翻译增强的框架，用于跨语言跨模态表示学习。我们扩充了现有的只有英语的数据集，通过机器翻译引入了其他语言的图像标题，然后将标准的 Masked Language Modeling 和 Image-Text Matching 训练目标扩展到多语言环境，通过共享视觉上下文（即使用图像作为枢纽）来捕获不同语言之间的对齐。最终我们提出了两个新的预训练任务，Masked Region-to-Token Modeling（MRTM）和 Visual Translation Language Modeling（VTLM），以加快图像和所有感兴趣语言的联合嵌入空间的学习。在多语言图像文本检索和多语言视觉问答基准上的评估表明，我们的提议框架在各种非英语基准上实现了新的最先进状态，并在英语任务上保持与单语预训练模型相当的性能。