In this paper, we propose VidLA, an approach for video-language alignment at scale. There are two major limitations of previous video-language alignment approaches. First, they do not capture both short-range and long-range temporal dependencies and typically employ complex hierarchical deep network architectures that are hard to integrate with existing pretrained image-text foundation models. To effectively address this limitation, we instead keep the network architecture simple and use a set of data tokens that operate at different temporal resolutions in a hierarchical manner, accounting for the temporally hierarchical nature of videos. By employing a simple two-tower architecture, we are able to initialize our video-language model with pretrained image-text foundation models, thereby boosting the final performance. Second, existing video-language alignment works struggle due to the lack of semantically aligned large-scale training data. To overcome it, we leverage recent LLMs to curate the largest video-language dataset to date with better visual grounding. Furthermore, unlike existing video-text datasets which only contain short clips, our dataset is enriched with video clips of varying durations to aid our temporally hierarchical data tokens in extracting better representations at varying temporal scales. Overall, empirical results show that our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks, especially on longer videos, and performs competitively on classification benchmarks.

我们提出了VidLA，这是一种用于大规模视频-语言对齐的方法，通过在不同时间分辨率上使用一组数据令牌，以层次化的方式捕捉短程和长程的时间依赖关系，并通过简单的双塔架构，使用预训练的图像-文本基础模型来提高最终性能。此外，我们利用最近的LLMs构建了迄今为止规模最大的视频-语言数据集，包含不同长度的视频片段，以帮助在不同时间尺度下提取更好的表示。实验证明，我们的方法在多个检索基准上超过了现有的最先进方法，特别是在较长的视频上，并在分类基准上具有竞争力。

VidLA: 视频-语言对齐的大规模实现