Cross-modal retrieval between videos and texts has gained increasing research
interest due to the rapid emergence of videos on the web. Generally, a video
contains rich instance and event information and the query text only describes
a part of the information. Thus, a video can correspond to multiple different
text descriptions and queries. We call this phenomenon the ``Video-Text
Correspondence Ambiguity'' problem. Current techniques mostly concentrate on
mining local or multi-level alignment between contents of a video and text
(\textit{e.g.}, object to entity and action to verb). It is difficult for these
methods to alleviate the video-text correspondence ambiguity by describing a
video using only one single feature, which is required to be matched with
multiple different text features at the same time. To address this problem, we
propose a Text-Adaptive Multiple Visual Prototype Matching model, which
automatically captures multiple prototypes to describe a video by adaptive
aggregation of video token features. Given a query text, the similarity is
determined by the most similar prototype to find correspondence in the video,
which is termed text-adaptive matching. To learn diverse prototypes for
representing the rich information in videos, we propose a variance loss to
encourage different prototypes to attend to different contents of the video.
Our method outperforms state-of-the-art methods on four public video retrieval
datasets.

本文提出了一种文本自适应多视觉原型匹配模型，通过自适应聚合视频标记特征来描述视频，以解决视频和文本之间的关联模糊问题，而且此方法表现优于当前公共视频检索数据集上的最新技术。

文本自适应的多视觉原型匹配用于视频检索

Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

We aim to understand how actions are performed and identify subtle
differences, such as 'fold firmly' vs. 'fold gently'. To this end, we propose a
method which recognizes adverbs across different actions. However, such
fine-grained annotations are difficult to obtain and their long-tailed nature
makes it challenging to recognize adverbs in rare action-adverb compositions.
Our approach therefore uses semi-supervised learning with multiple adverb
pseudo-labels to leverage videos with only action labels. Combined with
adaptive thresholding of these pseudo-adverbs we are able to make efficient use
of the available data while tackling the long-tailed distribution.
Additionally, we gather adverb annotations for three existing video retrieval
datasets, which allows us to introduce the new tasks of recognizing adverbs in
unseen action-adverb compositions and unseen domains. Experiments demonstrate
the effectiveness of our method, which outperforms prior work in recognizing
adverbs and semi-supervised works adapted for adverb recognition. We also show
how adverbs can relate fine-grained actions.

本研究提出一种半监督学习方法来识别动词的副词，用于了解不同行为之间微小的差异，具有很强的实证效果。