In recent years, the explosion of web videos makes text-video retrieval increasingly essential and popular for video filtering, recommendation, and search. text-video retrieval aims to rank relevant text/video hi
本篇论文提出了一种名为X-CLIP的多层次对比模型,通过Attention Over Similarity Matrix模块将多粒度相似度矩阵聚合到实例级别,大幅度提高了视频-文本检索的性能表现。在五个常用的视频文本检索数据集上,X-CLIP相较于之前最先进的模型提升了6.3%至11.1%,证明了多层次对比模型和AOSM模块的优越性。