Massive web-crawled image-text datasets lay the foundation for recent
progress in multimodal learning. These datasets are designed with the goal of
training a model to do well on standard computer vision benchmarks, many of
which, however, have been shown to be English-centric (e.g., I