Recent research has made impressive progress in large-scale multimodal
pre-training. In the context of the rapid growth of model size, it is necessary
to seek efficient and flexible methods other than finetuning. In this paper, we
propose to use prompt vectors to align the modalities.