Video question--answering is a fundamental task in the field of video
understanding. Although current vision--language models (VLMs) equipped with
video transformers have enabled temporal modeling and yielded superior results,
they are at the cost of huge computational power and thus t