Large-scale pre-training has brought unimodal fields such as computer vision
and natural language processing to a new era. Following this trend, the size of
multi-modal learning models constantly increases, leading to an urgent need to
reduce the massive computational cost of finetunin