Video moment retrieval is a fundamental visual-language task that aims to
retrieve target moments from an untrimmed video based on a language query.
Existing methods typically generate numerous proposals manually or via
generative networks in advance as the support set for retrieval, which is not
only inflexible but also time-consuming. Inspired by the success of diffusion
models on object detection, this work aims at reformulating video moment
retrieval as a denoising generation process to get rid of the inflexible and
time-consuming proposal generation. To this end, we propose a novel
proposal-free framework, namely DiffusionVMR, which directly samples random
spans from noise as candidates and introduces denoising learning to ground
target moments. During training, Gaussian noise is added to the real moments,
and the model is trained to learn how to reverse this process. In inference, a
set of time spans is progressively refined from the initial noise to the final
output. Notably, the training and inference of DiffusionVMR are decoupled, and
an arbitrary number of random spans can be used in inference without being
consistent with the training phase. Extensive experiments conducted on three
widely-used benchmarks (i.e., QVHighlight, Charades-STA, and TACoS) demonstrate
the effectiveness of the proposed DiffusionVMR by comparing it with
state-of-the-art methods.

该研究提出了一种名为 DiffusionVMR 的提议无关框架，通过将视频时刻检索重新构想为去噪生成过程，直接从噪声中采样随机时段作为候选，并引入去噪学习以确定目标时刻。实验证明 DiffusionVMR 相比现有方法具有更高的效果。

DiffusionVMR：视频时刻检索的扩散模型

DiffusionVMR: Diffusion Model for Video Moment Retrieval

The video grounding (VG) task aims to locate the queried action or event in
an untrimmed video based on rich linguistic descriptions. Existing
proposal-free methods are trapped in complex interaction between video and
query, overemphasizing cross-modal feature fusion and feature correlation for
VG. In this paper, we propose a novel boundary regression paradigm that
performs regression token learning in a transformer. Particularly, we present a
simple but effective proposal-free framework, namely Video Grounding
Transformer (ViGT), which predicts the temporal boundary using a learnable
regression token rather than multi-modal or cross-modal features. In ViGT, the
benefits of a learnable token are manifested as follows. (1) The token is
unrelated to the video or the query and avoids data bias toward the original
video and query. (2) The token simultaneously performs global context
aggregation from video and query features. First, we employed a sharing feature
encoder to project both video and query into a joint feature space before
performing cross-modal co-attention (i.e., video-to-query attention and
query-to-video attention) to highlight discriminative features in each
modality. Furthermore, we concatenated a learnable regression token [REG] with
the video and query features as the input of a vision-language transformer.
Finally, we utilized the token [REG] to predict the target moment and visual
features to constrain the foreground and background probabilities at each
timestamp. The proposed ViGT performed well on three public datasets: ANet
Captions, TACoS and YouCookII. Extensive ablation studies and qualitative
analysis further validated the interpretability of ViGT.

基于视觉和语言描述，本研究提出了一种新的边界回归范式来定位视频中的行为或事件，通过一个可学习的回归标记来预测时间边界，而非跨模态特征，取得了良好的效果并得到了进一步验证。