Modern visual question answering (VQA) models have been shown to rely heavily on superficial correlations between question and answer words learned during training such as overwhelmingly reporting the type of room as kitchen or the sport being played as tennis, irrespective of the imag