The domain of joint vision-language understanding, especially in the context of reasoning in visual question answering (VQA) models, has garnered significant attention in the recent past. While most of the existing VQA models focus on improving the accuracy of VQA, the way models arriv