Video moment retrieval is a challenging task requiring fine-grained interactions between video and text modalities. Recent work in image-text pretraining has demonstrated that most existing pretrained models suffer from information asymmetry due to the difference in length between visual and textual sequences. We question whether the same problem also exists in the video-text domain with an auxiliary need to preserve both spatial and temporal information. Thus, we evaluate a recently proposed solution involving the addition of an asymmetric co-attention network for video grounding tasks. Additionally, we incorporate momentum contrastive loss for robust, discriminative representation learning in both modalities. We note that the integration of these supplementary modules yields better performance compared to state-of-the-art models on the TACoS dataset and comparable results on ActivityNet Captions, all while utilizing significantly fewer parameters with respect to baseline.

视频时刻检索是一项具有挑战性的任务，需要视频和文本模态之间的精细交互。我们评估了一个最近提出的解决方案，在视频 grounding 任务中引入了不对称协同注意力网络，并在两个模态中加入了动量对比损失，其整合效果在 TACoS 数据集上表现更好，并在 ActivityNet Captions 上表现可比的结果，而且相对于基线模型，参数数量显著减少。

跨模态对比学习与非对称协同注意网络在视频时刻检索中的应用