Multimodal representation learning has shown promising improvements on
various vision-language tasks. Most existing methods excel at building
global-level alignment between vision and language while lacking effective
fine-grained image-text interaction. In this paper, we propose a jointly masked
multimodal modeling method to learn fine-grained multimodal representations.
Our method performs joint masking on image-text input and integrates both
implicit and explicit targets for the masked signals to recover. The implicit
target provides a unified and debiased objective for vision and language, where
the model predicts latent multimodal representations of the unmasked input. The
explicit target further enriches the multimodal representations by recovering
high-level and semantically meaningful information: momentum visual features of
image patches and concepts of word tokens. Through such a masked modeling
process, our model not only learns fine-grained multimodal interaction, but
also avoids the semantic gap between high-level representations and low- or
mid-level prediction targets (e.g. image pixels), thus producing semantically
rich multimodal representations that perform well on both zero-shot and
fine-tuned settings. Our pre-trained model (named MAMO) achieves
state-of-the-art performance on various downstream vision-language tasks,
including image-text retrieval, visual question answering, visual reasoning,
and weakly-supervised visual grounding.

本文提出一种联合掩蔽多模态建模方法 (MAMO)，通过联合掩盖图像 - 文本输入，并通过隐式和显式目标来恢复掩蔽信号，从而学习细粒度的多模态表示，实现高级和语义明确的信息恢复，取得了各种下游视觉 - 语言任务中的最新成果。

MAMO: 面向细粒度视觉语言表征学习的遮蔽多模态建模

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

Unsupervised large-scale vision-language pre-training has shown promising
advances on various downstream tasks. Existing methods often model the
cross-modal interaction either via the similarity of the global feature of each
modality which misses sufficient information, or finer-grained interactions
using cross/self-attention upon visual and textual tokens. However,
cross/self-attention suffers from inferior efficiency in both training and
inference. In this paper, we introduce a large-scale Fine-grained Interactive
Language-Image Pre-training (FILIP) to achieve finer-level alignment through a
cross-modal late interaction mechanism, which uses a token-wise maximum
similarity between visual and textual tokens to guide the contrastive
objective. FILIP successfully leverages the finer-grained expressiveness
between image patches and textual words by modifying only contrastive loss,
while simultaneously gaining the ability to pre-compute image and text
representations offline at inference, keeping both large-scale training and
inference efficient. Furthermore, we construct a new large-scale image-text
pair dataset called FILIP300M for pre-training. Experiments show that FILIP
achieves state-of-the-art performance on multiple downstream vision-language
tasks including zero-shot image classification and image-text retrieval. The
visualization on word-patch alignment further shows that FILIP can learn
meaningful fine-grained features with promising localization ability.

本文介绍一种利用跨模态后期交互机制实现精细级别对齐的大规模 Fine-grained 互动语言 - 图像预训练（FILIP）方法，并构建了一个用于预训练的新的大规模图像 - 文本对数据集。实验结果显示，FILIP 在多个视觉 - 语言任务中实现了最先进的性能，包括零 - shot 图像分类和图像 - 文本检索。

FILIP：细粒度的交互式语言图像预训练

FILIP: Fine-grained Interactive Language-Image Pre-Training

Recent advances regarding question answering and reading comprehension have
resulted in models that surpass human performance when the answer is contained
in a single, continuous passage of text, requiring only single-hop reasoning.
However, in actual scenarios, lots of complex queries require multi-hop
reasoning. The key to the Question Answering task is semantic feature
interaction between documents and questions, which is widely processed by
Bi-directional Attention Flow (Bi-DAF), but Bi-DAF generally captures only the
surface semantics of words in complex questions and fails to capture implied
semantic feature of intermediate answers. As a result, Bi-DAF partially ignores
part of the contexts related to the question and cannot extract the most
important parts of multiple documents. In this paper we propose a new model
architecture for multi-hop question answering, by applying two completion
strategies: (1) Coarse-Grain complex question Decomposition (CGDe) strategy are
introduced to decompose complex question into simple ones under the condition
of without any additional annotations (2) Fine-Grained Interaction (FGIn)
strategy are introduced to better represent each word in the document and
extract more comprehensive and accurate sentences related to the inference
path. The above two strategies are combined and tested on the SQuAD and
HotpotQA datasets, and the experimental results show that our method
outperforms state-of-the-art baselines.

本文提出了一种新的多跳问题回答模型架构，通过应用 CGDe 和 FGIn 两种策略，在 SQuAD 和 HotpotQA 数据集上表现出超越 state-of-the-art 基线的性能。