With recent progress in large-scale vision and language representation
learning, vision language pretraining (VLP) models have achieved promising
improvements on various multi-modal downstream tasks. Albeit powerful, these
pre-training models still do not take advantage of world knowle