Existing methods for video question answering (VideoQA) often suffer from
spurious correlations between different modalities, leading to a failure in
identifying the dominant visual evidence and the intended question. Moreover,
these methods function as black boxes, making it difficult