Vision (image and video) - Language (VL) pre-training is the recent popular
paradigm that achieved state-of-the-art results on multi-modal tasks like
image-retrieval, video-retrieval, visual question answering etc. These models
are trained in an unsupervised way and greatly benefit from the complementary
modality supervision. In this paper, we explore if the