In the past few years, the emergence of pre-training models has brought
uni-modal fields such as computer vision (CV) and natural language processing
(NLP) to a new era. Substantial works have shown they are beneficial for
downstream uni-modal tasks and avoid training a new model from