In this paper we describe our work towards building a generic framework for
both multi-modal embedding and multi-label binary classification tasks, while
participating in task 5 (Multimedia Automatic Misogyny Identification) of
SemEval 2022 competition.
Since pretraining deep models from scratch is a resource and data hungry
task, our approach is based on three main strategies. We combine different
state-of-the-art architectures to capture a wide spectrum of semantic signals
from the multi-modal input. We employ a multi-task learning scheme to be able
to use multiple datasets from the same knowledge domain to help increase the
model's performance. We also use multiple objectives to regularize and fine
tune different system components.

该论文描述了他们在 SemEval 2022 竞赛的任务 5（多媒体自动仇恨辨别）中，构建通用框架以处理多模式嵌入和多标签二进制分类任务的工作。为了避免深度模型从零开始的资源和数据饥饿问题，作者采用三种主要策略，即组合不同的先进架构来捕捉来自多模态输入的广泛语义信号，采用多任务学习模式来利用来自同一领域的多个数据集以提高模型性能以及使用多个目标来规范和微调不同的系统组件。

SemEval-2022 任务 5：多模态多变压器厌恶女性主义迷因分类框架的编解码器

Codec at SemEval-2022 Task 5: Multi-Modal Multi-Transformer Misogynous Meme Classification Framework

Pre-trained large-scale models provide a transferable embedding, and they
show promising performance on diverse downstream tasks. However, the analysis
of learned embedding has not been explored well, and the transferability for
cross-modal tasks can be improved. This paper provides a perspective to
understand multi-modal embedding in terms of uniformity and alignment. We newly
find that the representation learned by multi-modal learning models such as
CLIP has two separated embedding spaces for each heterogeneous dataset with
less alignment. Besides, there are unexplored large intermediate areas between
the two modalities with less uniformity. As a result, lack of alignment and
uniformity might restrict the robustness and transferability of the
representation for the downstream task. To this end, we provide a new
end-to-end fine-tuning method for robust representation that encourages better
uniformity and alignment score. First, we propose a \textit{Geodesic
Multi-Modal Mixup} that mixes the representation of image and text to generate
the hard negative samples on the hyperspherical embedding space. Second, we
fine-tune the multi-modal model on hard negative samples as well as normal
negatives and positive samples with contrastive loss. Through extensive
experiments on retrieval, classification, and structure-awareness task, we
demonstrate that our geodesic multi-modal Mixup learns a robust representation
and provides improved performance on various downstream tasks.

本研究提供了一种理解多模态嵌入的视角，并提出了一种新的端到端微调方法，以鼓励更好的统一性和对齐得分，通过大量的检索、分类和结构感知任务的实验，证明了我们的地球多模态 Mixup 学习到了一个强健的表示，并在各种下游任务上提供了改进的性能。

麻省理工学院提出了新的多模态混合方法 —— 测地线多模态混合，以实现强化微调

Geodesic Multi-Modal Mixup for Robust Fine-Tuning

Several works have proposed to learn a two-path neural network that maps
images and texts, respectively, to a same shared Euclidean space where geometry
captures useful semantic relationships. Such a multi-modal embedding can be
trained and used for various tasks, notably image captioning. In the present
work, we introduce a new architecture of this type, with a visual path that
leverages recent space-aware pooling mechanisms. Combined with a textual path
which is jointly trained from scratch, our semantic-visual embedding offers a
versatile model. Once trained under the supervision of captioned images, it
yields new state-of-the-art performance on cross-modal retrieval. It also
allows the localization of new concepts from the embedding space into any input
image, delivering state-of-the-art result on the visual grounding of phrases.

本研究提出了一种新的两条路径的神经网络，其中视觉路径采用了最新的空间感知池化机制模型，结合从头开始训练的文本路径，实现了一种多模态嵌入。在处理带有标注图像的任务中经过训练后，该模型可提供新的跨模态检索性能和短语的视觉定位，达到了最新的最佳表现。