Most vision-and-language pretraining research focuses on English tasks.
However, the creation of multilingual multimodal evaluation datasets (e.g.
Multi30K, xGQA, XVNLI, and MaRVL) poses a new challenge in finding high-quality
training data that is both multilingual and multimodal. In