Large language models (LLMs) have increased interest in vision language
models (VLMs), which process image-text pairs as input. Studies investigating
the visual understanding ability of VLMs have been proposed, but such studies
are still preliminary because existing datasets do not permit a comprehensive
evaluation of the fine-grained visual linguistic abilities of VLMs across
multiple languages. To further explore the strengths of VLMs, such as GPT-4V
\cite{openai2023GPT4}, we developed new datasets for the systematic and
qualitative analysis of VLMs. Our contribution is four-fold: 1) we introduced
nine vision-and-language (VL) tasks (including object recognition, image-text
matching, and more) and constructed multilingual visual-text datasets in four
languages: English, Japanese, Swahili, and Urdu through utilizing templates
containing \textit{questions} and prompting GPT4-V to generate the
\textit{answers} and the \textit{rationales}, 2) introduced a new VL task named
\textit{unrelatedness}, 3) introduced rationales to enable human understanding
of the VLM reasoning process, and 4) employed human evaluation to measure the
suitability of proposed datasets for VL tasks. We show that VLMs can be
fine-tuned on our datasets. Our work is the first to conduct such analyses in
Swahili and Urdu. Also, it introduces \textit{rationales} in VL analysis, which
played a vital role in the evaluation.

我们通过模板构建了四种语言的多语言视觉文本数据集，介绍了九项视觉语言任务，并引入了解释机制以评估大型语言模型在视觉语言任务上的表现。

构建多语言视觉文本数据集揭示视觉语言模型的多语言能力

Constructing Multilingual Visual-Text Datasets Revealing Visual  Multilingual Ability of Vision Language Models

Highlighting particularly relevant regions of an image can improve the
performance of vision-language models (VLMs) on various vision-language (VL)
tasks by guiding the model to attend more closely to these regions of interest.
For example, VLMs can be given a "visual prompt", where visual markers such as
bounding boxes delineate key image regions. However, current VLMs that can
incorporate visual guidance are either proprietary and expensive or require
costly training on curated data that includes visual prompts. We introduce
Contrastive Region Guidance (CRG), a training-free guidance method that enables
open-source VLMs to respond to visual prompts. CRG contrasts model outputs
produced with and without visual prompts, factoring out biases revealed by the
model when answering without the information required to produce a correct
answer (i.e., the model's prior). CRG achieves substantial improvements in a
wide variety of VL tasks: When region annotations are provided, CRG increases
absolute accuracy by up to 11.1% on ViP-Bench, a collection of six diverse
region-based tasks such as recognition, math, and object relationship
reasoning. We also show CRG's applicability to spatial reasoning, with 10%
improvement on What'sUp, as well as to compositional generalization --
improving accuracy by 11.5% and 7.5% on two challenging splits from SugarCrepe
-- and to image-text alignment for generated images, where we improve by up to
8.4 AUROC and 6.8 F1 points on SeeTRUE. When reference regions are absent, CRG
allows us to re-rank proposed regions in referring expression comprehension and
phrase grounding benchmarks like RefCOCO/+/g and Flickr30K Entities, with an
average gain of 3.2% in accuracy. Our analysis explores alternative masking
strategies for CRG, quantifies CRG's probability shift, and evaluates the role
of region guidance strength, empirically validating CRG's design choices.

通过对视觉线索进行引导，使用对比区域引导（CRG）方法可以提高视觉 - 语言模型（VLMs）在多种视觉 - 语言任务上的性能，减少模型偏见，提高准确性。

对比区域指导：在无需训练的视觉语言模型中改善定位

Contrastive Region Guidance: Improving Grounding in Vision-Language  Models without Training

Attention mechanism has been used as an important component across
Vision-and-Language(VL) tasks in order to bridge the semantic gap between
visual and textual features. While attention has been widely used in VL tasks,
it has not been examined the capability of different attention alignment
calculation in bridging the semantic gap between visual and textual clues. In
this research, we conduct a comprehensive analysis on understanding the role of
attention alignment by looking into the attention score calculation methods and
check how it actually represents the visual region's and textual token's
significance for the global assessment. We also analyse the conditions which
attention score calculation mechanism would be more (or less) interpretable,
and which may impact the model performance on three different VL tasks,
including visual question answering, text-to-image generation, text-and-image
matching (both sentence and image retrieval). Our analysis is the first of its
kind and provides useful insights of the importance of each attention alignment
score calculation when applied at the training phase of VL tasks, commonly
ignored in attention-based cross modal models, and/or pretrained models. Our
code is available at: this https URL

本文综合分析了不同注意力计算方法在视觉与文本特征之间建立语义联系方面的作用，以及该计算机制的可解释性与模型性能之间的关系，研究结果表明不同的计算机制在不同 VL 任务的表现存在差异，这为注意力机制在 VL 任务中的训练中提供了洞见，对于建立跨模态模型和预训练模型都具有启示作用。

了解视觉和语言任务中的注意力

Understanding Attention for Vision-and-Language Tasks

Vision-language (VL) pre-training has recently received considerable
attention. However, most existing end-to-end pre-training approaches either
only aim to tackle VL tasks such as image-text retrieval, visual question
answering (VQA) and image captioning that test high-level understanding of
images, or only target region-level understanding for tasks such as phrase
grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based
transformER), a new VL model architecture that can seamlessly handle both these
types of tasks. Instead of having dedicated transformer layers for fusion after
the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by
inserting cross-attention into the image and text backbones, bringing gains in
terms of memory and performance. In addition, unlike previous work that is
either only pre-trained on image-text data or on fine-grained data with
box-level annotations, we present a two-stage pre-training strategy that uses
both these kinds of data efficiently: (i) coarse-grained pre-training based on
image-text data; followed by (ii) fine-grained pre-training based on
image-text-box data. We conduct comprehensive experiments on a wide range of VL
tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding,
referring expression comprehension, and object detection. Using deep multimodal
fusion coupled with the two-stage pre-training, FIBER provides consistent
performance improvements over strong baselines across all tasks, often
outperforming methods using magnitudes more data. Code is available at
this https URL.

FIBER 是一个用于 Vision Language（VL）的新型 VL 模型结构，通过将交叉注意力插入图像和文本骨干网络，将多模态融合深入到模型中，并使用两阶段预训练策略，可以在 VL 任务中提供一致的性能提升。