Text-video retrieval is a challenging cross-modal task, which aims to align visual entities with natural language descriptions. Current methods either fail to leverage the local details or are computationally expensive. What's worse, they fail to leverage the heterogeneous concepts in data. In this paper, we propose the Disentangled Conceptualization and Set-to-set Alignment (DiCoSA) to simulate the conceptualizing and reasoning process of human beings. For disentangled conceptualization, we divide the coarse feature into multiple latent factors related to semantic concepts. For set-to-set alignment, where a set of visual concepts correspond to a set of textual concepts, we propose an adaptive pooling method to aggregate semantic concepts to address the partial matching. In particular, since we encode concepts independently in only a few dimensions, DiCoSA is superior at efficiency and granularity, ensuring fine-grained interactions using a similar computational complexity as coarse-grained alignment. Extensive experiments on five datasets, including MSR-VTT, LSMDC, MSVD, ActivityNet, and DiDeMo, demonstrate that our method outperforms the existing state-of-the-art methods.

本研究提出了一种名为Disentangled Conceptualization and Set-to-set Alignment (DiCoSA) 的跨模态任务解决方法，可以将视觉实体与自然语言描述对齐，采用多个与语义概念相关的潜在因素来划分粗特征进行概念化，使用自适应的池化方法来聚合语义概念以解决部分匹配，并通过在少量维度上独立编码概念确保细粒度交互，从而实现高效和精细的交互。在多个数据集上的实验表明，该方法优于现有的最先进方法。

利用解耦概念化和集合对齐进行文本-视频检索