Most existing methods in vision language pre-training rely on object-centric
features extracted through object detection and make fine-grained alignments
between the extracted features and texts. It is challenging for these methods
to learn relations among multiple objects. To this end, we propose a new method
called X-VLM to perform `multi-grained vision language pre-training.' The key
to learning multi-grained alignments is to locate visual concepts in the image
given the associated texts, and in the meantime align the texts with the visual
concepts, where the alignments are in multi-granularity. Experimental results
show that X-VLM effectively leverages the learned multi-grained alignments to
many downstream vision language tasks and consistently outperforms
state-of-the-art methods.

提出了一种名为 X-VLM 的多粒度视觉语言预训练方法，通过定位图像中的视觉概念并将其与文本进行对齐，实现了多粒度对齐，并将其应用于下游视觉语言任务中取得了优秀的效果，并超越了现有的最先进方法。

多层次视觉语言预训练：将文本与视觉概念对齐

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual  Concepts

It is a common practice for recent works in vision language cross-modal
reasoning to adopt a binary or multi-choice classification formulation taking
as input a set of source image(s) and textual query. In this work, we take a
sober look at such an unconditional formulation in the sense that no prior
knowledge is specified with respect to the source image(s). Inspired by the
designs of both visual commonsense reasoning and natural language inference
tasks, we propose a new task termed Premise-based Multi-modal Reasoning(PMR)
where a textual premise is the background presumption on each source image. The
PMR dataset contains 15,360 manually annotated samples which are created by a
multi-phase crowd-sourcing process. With selected high-quality movie
screenshots and human-curated premise templates from 6 pre-defined categories,
we ask crowd-source workers to write one true hypothesis and three distractors
(4 choices) given the premise and image through a cross-check procedure.
Besides, we generate adversarial samples to alleviate the annotation artifacts
and double the size of PMR. We benchmark various state-of-the-art (pretrained)
multi-modal inference models on PMR and conduct comprehensive experimental
analyses to showcase the utility of our dataset.

本文提出了一种基于 Premise 的多模态推理任务，建立了 PMR 数据集用于评价多模态推理模型的性能。

基于前提的多模态推理：联合文本与视觉线索的条件推断

Premise-based Multimodal Reasoning: Conditional Inference on Joint  Textual and Visual Clues

This paper presents a detailed study of improving visual representations for
vision language (VL) tasks and develops an improved object detection model to
provide object-centric representations of images. Compared to the most widely
used \emph{bottom-up and top-down} model \cite{anderson2018bottom}, the new
model is bigger, better-designed for VL tasks, and pre-trained on much larger
training corpora that combine multiple public annotated object detection
datasets. Therefore, it can generate representations of a richer collection of
visual objects and concepts. While previous VL research focuses mainly on
improving the vision-language fusion model and leaves the object detection
model improvement untouched, we show that visual features matter significantly
in VL models. In our experiments we feed the visual features generated by the
new object detection model into a Transformer-based VL fusion model \oscar
\cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the
VL model and fine-tune it on a wide range of downstream VL tasks. Our results
show that the new visual features significantly improve the performance across
all VL tasks, creating new state-of-the-art results on seven public benchmarks.
We will release the new object detection model to public.

本文通过提出一种改进的物体检测模型，生成具有更丰富视觉对象和概念的物体中心表示，从而显着提高了视觉语言任务的性能，并在七个公共基准测试中创造了新的最先进结果。