While several benefits were realized for multilingual vision-language
pretrained models, recent benchmarks across various tasks and languages showed
poor cross-lingual generalisation when multilingually pre-trained
vision-language models are applied to non-English data, with a large ga