Multi-modal semantic understanding requires integrating information from
different modalities to extract users' real intention behind words. Most
previous work applies a dual-encoder structure to separately encode image and
text, but fails to learn cross-modal feature alignment, making it hard to
achieve cross-modal deep information interaction. This paper proposes a novel
CLIP-guided contrastive-learning-based architecture to perform multi-modal
feature alignment, which projects the features derived from different
modalities into a unified deep space. On multi-modal sarcasm detection (MMSD)
and multi-modal sentiment analysis (MMSA) tasks, the experimental results show
that our proposed model significantly outperforms several baselines, and our
feature alignment strategy brings obvious performance gain over models with
different aggregating methods and models even enriched with knowledge. More
importantly, our model is simple to implement without using task-specific
external knowledge, and thus can easily migrate to other multi-modal tasks. Our
source codes are available at this https URL

本篇研究提出了一种基于 CLIP 引导的对比学习的架构，用于执行多模态特征对齐，将来自不同模态的特征投影到一个统一的深度空间，实验结果表明，我们提出的模型在多模态讽刺检测和多模态情感分析任务中明显优于多个基准模型，我们的特征对齐策略相对于其他聚合方法和甚至富含知识的模型也带来了明显的性能增益，此外，我们的模型实现简单，无需使用特定任务的外部知识，因此可以轻松迁移到其他多模态任务。

多模态语义理解与对比跨模态特征对齐

Multi-modal Semantic Understanding with Contrastive Cross-modal Feature  Alignment

Zero-shot classification of image scenes which can recognize the image scenes
that are not seen in the training stage holds great promise of lowering the
dependence on large numbers of labeled samples. To address the zero-shot image
scene classification, the cross-modal feature alignment methods have been
proposed in recent years. These methods mainly focus on matching the visual
features of each image scene with their corresponding semantic descriptors in
the latent space. Less attention has been paid to the contrastive relationships
between different image scenes and different semantic descriptors. In light of
the challenge of large intra-class difference and inter-class similarity among
image scenes and the potential noisy samples, these methods are susceptible to
the influence of the instances which are far from these of the same classes and
close to these of other classes. In this work, we propose a multi-level
cross-modal feature alignment method via contrastive learning for zero-shot
classification of remote sensing image scenes. While promoting the
single-instance level positive alignment between each image scene with their
corresponding semantic descriptors, the proposed method takes the
cross-instance contrastive relationships into consideration,and learns to keep
the visual and semantic features of different classes in the latent space apart
from each other. Extensive experiments have been done to evaluate the
performance of the proposed method. The results show that our proposed method
outperforms state of the art methods for zero-shot remote sensing image scene
classification. All the code and data are available at github
this https URL

本文提出了一种通过对比学习进行多层次交叉模态特征对齐的方法，以用于遥感图像场景的零样本分类，实验结果表明该方法优于目前现有的零样本遥感图像场景分类方法。

基于对比学习的多层交叉模态特征对齐，用于遥感图像场景的零样本分类

Multi-level Cross-modal Feature Alignment via Contrastive Learning  towards Zero-shot Classification of Remote Sensing Image Scenes

Building a universal Video-Language model for solving various video
understanding tasks (\emph{e.g.}, text-video retrieval, video question
answering) is an open challenge to the machine learning field. Towards this
goal, most recent works build the model by stacking uni-modal and cross-modal
feature encoders and train it with pair-wise contrastive pre-text tasks. Though
offering attractive generality, the resulted models have to compromise between
efficiency and performance. They mostly adopt different architectures to deal
with different downstream tasks. We find this is because the pair-wise training
cannot well \emph{align} and \emph{fuse} features from different modalities. We
then introduce \textbf{Clover}\textemdash a Correlated Video-Language
pre-training method\textemdash towards a universal Video-Language model for
solving multiple video understanding tasks with neither performance nor
efficiency compromise. It improves cross-modal feature alignment and fusion via
a novel tri-modal alignment pre-training task. Additionally, we propose to
enhance the tri-modal alignment via incorporating learning from semantic masked
samples and a new pair-wise ranking loss. Clover establishes new
state-of-the-arts on multiple downstream tasks, including three retrieval tasks
for both zero-shot and fine-tuning settings, and eight video question answering
tasks. Codes and pre-trained models will be released at
https://github.com/LeeYN-43/Clover.

本文提出了 Clover 方法，通过一种新颖的三模式对齐预训练任务，提高了跨模式特征对齐和融合，同时通过从语义掩蔽样本学习和新的成对排名损失增强三模式对齐。Clover 在多个下游任务中取得了新的最先进水平，包括零 - shot 和微调设置下的三个检索任务和八个视频问答任务。