As transformer evolves, pre-trained models have advanced at a breakneck pace
in recent years. They have dominated the mainstream techniques in natural
language processing (NLP) and computer vision (CV). How to adapt pre-training
to the field of Vision-and-Language (V-L) learning and im