Contrastive image-text models such as CLIP form the building blocks of many
state-of-the-art systems. While they excel at recognizing common generic
concepts, they still struggle on fine-grained entities which are rare, or even
absent from the pre-training dataset. Hence, a key ingredient to their success
has been the use of large-scale curated pre-training data aiming at expanding
the set of concepts that they can memorize during the pre-training stage. In
this work, we explore an alternative to encoding fine-grained knowledge
directly into the model's parameters: we instead train the model to retrieve
this knowledge from an external memory. Specifically, we propose to equip
existing vision-text models with the ability to refine their embedding with
cross-modal retrieved information from a memory at inference time, which
greatly improves their zero-shot predictions. Remarkably, we show that this can
be done with a light-weight, single-layer, fusion transformer on top of a
frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive
(RECO) training improves CLIP performance substantially on several challenging
fine-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and
+7.3 on the recent OVEN benchmark.

本论文提出了 RECO 模型，该模型通过外部记忆检索获取精细化知识，应用于现有视觉文本模型中，并在 Stanford Cars、CUB-2011 和 OVEN benchmark 等多项任务中取得了显著性能提升。

检索增强对比视觉 - 文本模型

Retrieval-Enhanced Contrastive Vision-Text Models

Multi-modal learning from video data has seen increased attention recently as
it allows to train semantically meaningful embeddings without human annotation
enabling tasks like zero-shot retrieval and classification. In this work, we
present a multi-modal, modality agnostic fusion transformer approach that
learns to exchange information between multiple modalities, such as video,
audio, and text, and integrate them into a joined multi-modal representation to
obtain an embedding that aggregates multi-modal temporal information. We
propose to train the system with a combinatorial loss on everything at once,
single modalities as well as pairs of modalities, explicitly leaving out any
add-ons such as position or modality encoding. At test time, the resulting
model can process and fuse any number of input modalities. Moreover, the
implicit properties of the transformer allow to process inputs of different
lengths. To evaluate the proposed approach, we train the model on the large
scale HowTo100M dataset and evaluate the resulting embedding space on four
challenging benchmark datasets obtaining state-of-the-art results in zero-shot
video retrieval and zero-shot video action localization.

本文提出一种基于多模态、模态无关的融合变压器方法，通过交换多个模态之间的信息并将其整合成一个联合的多模态表示，从而获得聚合多模态时态信息的嵌入，可用于零 - shot 检索和分类。我们在 HowTo100M 数据集上训练模型，并在四个具有挑战性的基准数据集上评估结果，取得了零 - shot 视频检索和零 - shot 视频行动定位的最新成果。