In this paper, we explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding. Our hypothesis is validated through the classic referring video object segmentation (R-VOS) task. We introduce a novel framework, termed ``VD-IT'', tailored with dedicatedly designed components built upon a fixed pretrained T2V model. Specifically, VD-IT uses textual information as a conditional input, ensuring semantic consistency across time for precise temporal instance matching. It further incorporates image tokens as supplementary textual inputs, enriching the feature set to generate detailed and nuanced masks.Besides, instead of using the standard Gaussian noise, we propose to predict the video-specific noise with an extra noise prediction module, which can help preserve the feature fidelity and elevates segmentation quality. Through extensive experiments, we surprisingly observe that fixed generative T2V diffusion models, unlike commonly used video backbones (e.g., Video Swin Transformer) pretrained with discriminative image/video pre-tasks, exhibit better potential to maintain semantic alignment and temporal consistency. On existing standard benchmarks, our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods. The code will be available at \url{https://github.com/buxiangzhiren/VD-IT}

探索了基于预训练的文本到视频(T2V)扩散模型产生的视觉表示，在视频理解任务中的应用，通过验证经典的参考视频对象分割(R-VOS)任务的假设，介绍了一个名为“VD-IT”的新框架，结合了预训练的T2V模型，利用文本信息作为条件输入，确保时间上的语义一致性，进一步加入图像标记作为补充文本输入，丰富特征集合以生成详细和细腻的掩码，并且通过大量实验证明，与常用的基于图像/视频预训练任务的视频骨干网络（例如Video Swin Transformer）相比，固定的生成T2V扩散模型在保持语义对齐和时间一致性方面具有更好的潜力，在现有的标准基准上，VD-IT取得了非常有竞争力的结果。

探索预训练的文本到视频传播模型用于视频对象分割