Video Temporal Grounding (VTG) aims to identify visual frames in a video clip that match text queries. Recent studies in VTG employ cross-attention to correlate visual frames and text queries as individual token sequences. However, these approaches overlook a crucial aspect of the problem: a holistic understanding of the query sentence. A model may capture correlations between individual word tokens and arbitrary visual frames while possibly missing out on the global meaning. To address this, we introduce two primary contributions: (1) a visual frame-level gate mechanism that incorporates holistic textual information, (2) cross-modal alignment loss to learn the fine-grained correlation between query and relevant frames. As a result, we regularize the effect of individual word tokens and suppress irrelevant visual frames. We demonstrate that our method outperforms state-of-the-art approaches in VTG benchmarks, indicating that holistic text understanding guides the model to focus on the semantically important parts within the video.

本研究针对视频时间定位（VTG）中的文本查询与视频帧匹配问题，提出了一种整合整体文本理解的新方法。通过引入视觉帧级门控机制和跨模态对齐损失，我们改进了视频帧与文本查询之间的细致关联，显著提高了模型在VTG基准测试中的表现，强调了整体文本理解在定位语义重要视频部分中的关键作用。

让我说完我的句子：通过整体文本理解进行视频时间定位