Video moment retrieval is a fundamental visual-language task that aims to retrieve target moments from an untrimmed video based on a language query. Existing methods typically generate numerous proposals manually or via generative networks in advance as the support set for retrieval, which is not only inflexible but also time-consuming. Inspired by the success of diffusion models on object detection, this work aims at reformulating video moment retrieval as a denoising generation process to get rid of the inflexible and time-consuming proposal generation. To this end, we propose a novel proposal-free framework, namely DiffusionVMR, which directly samples random spans from noise as candidates and introduces denoising learning to ground target moments. During training, Gaussian noise is added to the real moments, and the model is trained to learn how to reverse this process. In inference, a set of time spans is progressively refined from the initial noise to the final output. Notably, the training and inference of DiffusionVMR are decoupled, and an arbitrary number of random spans can be used in inference without being consistent with the training phase. Extensive experiments conducted on three widely-used benchmarks (i.e., QVHighlight, Charades-STA, and TACoS) demonstrate the effectiveness of the proposed DiffusionVMR by comparing it with state-of-the-art methods.

该研究提出了一种名为DiffusionVMR的提议无关框架，通过将视频时刻检索重新构想为去噪生成过程，直接从噪声中采样随机时段作为候选，并引入去噪学习以确定目标时刻。实验证明DiffusionVMR相比现有方法具有更高的效果。

DiffusionVMR：视频时刻检索的扩散模型