Video-text Large Language Models (video-text LLMs) have shown remarkable
performance in answering questions and holding conversations on simple videos.
However, they perform almost the same as random on grounding text queries in
long and complicated videos, having little ability to understand and reason
about temporal information, which is the most fundamental difference between
videos and images. In this paper, we propose HawkEye, one of the first
video-text LLMs that can perform temporal video grounding in a fully
text-to-text manner. To collect training data that is applicable for temporal
video grounding, we construct InternVid-G, a large-scale video-text corpus with
segment-level captions and negative spans, with which we introduce two new
time-aware training objectives to video-text LLMs. We also propose a
coarse-grained method of representing segments in videos, which is more robust
and easier for LLMs to learn and follow than other alternatives. Extensive
experiments show that HawkEye is better at temporal video grounding and
comparable on other video-text tasks with existing video-text LLMs, which
verifies its superior video-text multi-modal understanding abilities.

我们提出了 HawkEye，这是第一个完全以文本形式进行时间视频定位的视频 - 文本 LLM，并构建了 InternVid-G，这是一个具有分段级字幕和负跨度的大规模视频 - 文本语料库，我们引入了两个新的时间感知型训练目标，以及一种粗粒度的视频片段表示方法，这使得 HawkEye 在时间视频定位方面表现更好，同时在其他视频 - 文本任务上与现有视频 - 文本 LLM 相当，验证了其卓越的视频 - 文本多模态理解能力。