Current pre-trained vison-language models (PVLMs) achieve excellent
performance on a range of multi-modal datasets. Recent work has aimed at
building multilingual models, and a range of novel multilingual multi-modal
datasets have been proposed. Current PVLMs typically perform poorly o