With the burgeoning amount of data of image-text pairs and diversity of
vision-and-language (V&L) tasks, scholars have introduced an abundance of deep
learning models in this research domain. Furthermore, in recent years, transfer
learning has also shown tremendous success in Computer