Conventional Transformer-based Video Question Answering (VideoQA) approaches
generally encode frames independently through one or more image encoders
followed by interaction between frames and question. However, such schema would
incur significant memory use and inevitably slow down the training and
inference speed. In this work, we present a highly efficient approach for
VideoQA based on existing vision-language pre-trained models where we
concatenate video frames to a $n\times n$ matrix and then convert it to one
image. By doing so, we reduce the use of the image encoder from $n^{2}$ to $1$
while maintaining the temporal structure of the original video. Experimental
results on MSRVTT and TrafficQA show that our proposed approach achieves
state-of-the-art performance with nearly $4\times$ faster speed and only 30%
memory use. We show that by integrating our approach into VideoQA systems we
can achieve comparable, even superior, performance with a significant speed up
for training and inference. We believe the proposed approach can facilitate
VideoQA-related research by reducing the computational requirements for those
who have limited access to budgets and resources. Our code will be made
publicly available for research use.

本文提出了一种高效的基于现有的视觉 - 语言预训练模型的视频问答方法，该方法将视频帧连接成 $n	imes n$ 的矩阵，从而将图像编码器的使用量从 $n^2$ 减少到 1，保持了原始视频的时间结构。实验结果表明，我们的方法在 MSRVTT 和 TrafficQA 数据集上取得了与当前最佳方法相同甚至更好的性能，速度快近 4 倍，使用的内存仅占现有方法的 30%，能够节省计算资源。