In recent years, there is a growing number of pre-trained models trained on a
large corpus of data and yielding good performance on various tasks such as
classifying multimodal datasets. These models have shown good performance on
natural images but are not fully explored for scarce ab