Videos are highly redundant data source and it is often enough to identify a
few key moments to solve any given task. In this paper, we present a
text-conditioned video resampler (TCR) module that uses a pre-trained and
frozen visual encoder and large language model (LLM) to process long video
sequences for a task. TCR localises relevant visual features from the video
given a text condition and provides them to a LLM to generate a text response.
Due to its lightweight design and use of cross-attention, TCR can process more
than 100 frames at a time allowing the model to use much longer chunks of video
than earlier works. We make the following contributions: (i) we design a
transformer-based sampling architecture that can process long videos
conditioned on a task, together with a training method that enables it to
bridge pre-trained visual and language models; (ii) we empirically validate its
efficacy on a wide variety of evaluation tasks, and set a new state-of-the-art
on NextQA, EgoSchema, and the EGO4D-LTA challenge; and (iii) we determine tasks
which require longer video contexts and that can thus be used effectively for
further evaluation of long-range video models.

使用文本条件的视频重采样器（TCR）模块和预训练的视觉编码器和大型语言模型（LLM），我们设计了一种基于 Transformer 的采样架构，可以处理长视频序列，并通过交叉注意机制，将相关的视觉特征从视频中提取出来，并通过 LLM 生成文本响应。我们的方法在各种评估任务中表现出很好的效果，并在 NextQA、EgoSchema 和 EGO4D-LTA 挑战赛中创造了最新的最好成绩，我们还发现了需要较长视频上下文的任务，可以有效地用于进一步评估长程视频模型。