Object detection is a computer vision task of predicting a set of bounding
boxes and category labels for each object of interest in a given image. The
category is related to a linguistic symbol such as 'dog' or 'person' and there
should be relationships among them. However the object detector only learns to
classify the categories and does not treat them as the linguistic symbols.
Multi-modal models often use the pre-trained object detector to extract object
features from the image, but the models are separated from the detector and the
extracted visual features does not change with their linguistic input. We
rethink the object detection as a vision-and-language reasoning task. We then
propose targeted detection task, where detection targets are given by a natural
language and the goal of the task is to detect only all the target objects in a
given image. There are no detection if the target is not given. Commonly used
modern object detectors have many hand-designed components like anchor and it
is difficult to fuse the textual inputs into the complex pipeline. We thus
propose Language-Targeted Detector (LTD) for the targeted detection based on a
recently proposed Transformer-based detector. LTD is a encoder-decoder
architecture and our conditional decoder allows the model to reason about the
encoded image with the textual input as the linguistic context. We evaluate
detection performances of LTD on COCO object detection dataset and also show
that our model improves the detection results with the textual input grounding
to the visual object.

本文探讨了一种将物体检测转化为视觉与语言推理任务的方法，并提出了一种基于 Transformer 的编码器 - 解码器体系结构下的语言目标检测器（LTD），该方法将文本输入作为语言上下文进行推理，扩展了现有物体检测器的分类功能。通过对 COCO 数据集的检测表现进行评估，证明了 LTD 不仅可以改善物体检测结果，还可以通过文本输入与视觉对象的基础连接，更好地推理目标检测任务。

仅依据指定的语言目标进行物体检测

Detect Only What You Specify : Object Detection with Linguistic Target

For vision-and-language reasoning tasks, both fully connectionist, end-to-end
methods and hybrid, neuro-symbolic methods have achieved high in-distribution
performance. In which out-of-distribution settings does each paradigm excel? We
investigate this question on both single-image and multi-image visual
question-answering through four types of generalization tests: a novel
segment-combine test for multi-image queries, contrast set, compositional
generalization, and cross-benchmark transfer. Vision-and-language end-to-end
trained systems exhibit sizeable performance drops across all these tests.
Neuro-symbolic methods suffer even more on cross-benchmark transfer from GQA to
VQA, but they show smaller accuracy drops on the other generalization tests and
their performance quickly improves by few-shot training. Overall, our results
demonstrate the complementary benefits of these two paradigms, and emphasize
the importance of using a diverse suite of generalization tests to fully
characterize model robustness to distribution shift.

本文探讨了深度学习中基于神经网络的联合算是和符号逻辑算法的表现优势，并着重研究了基于多种泛化测试的性能。实验结果表明，这两种方法各具优势，使用多种泛化测试可以更全面的评估模型健壮性和通用性。

端到端和神经符号视觉语言推理系统之间的泛化差异

Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems

Vision-and-language reasoning requires an understanding of visual concepts,
language semantics, and, most importantly, the alignment and relationships
between these two modalities. We thus propose the LXMERT (Learning
Cross-Modality Encoder Representations from Transformers) framework to learn
these vision-and-language connections. In LXMERT, we build a large-scale
Transformer model that consists of three encoders: an object relationship
encoder, a language encoder, and a cross-modality encoder. Next, to endow our
model with the capability of connecting vision and language semantics, we
pre-train the model with large amounts of image-and-sentence pairs, via five
diverse representative pre-training tasks: masked language modeling, masked
object prediction (feature regression and label classification), cross-modality
matching, and image question answering. These tasks help in learning both
intra-modality and cross-modality relationships. After fine-tuning from our
pre-trained parameters, our model achieves the state-of-the-art results on two
visual question answering datasets (i.e., VQA and GQA). We also show the
generalizability of our pre-trained cross-modality model by adapting it to a
challenging visual-reasoning task, NLVR2, and improve the previous best result
by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies
to prove that both our novel model components and pre-training strategies
significantly contribute to our strong results; and also present several
attention visualizations for the different encoders. Code and pre-trained
models publicly available at: this https URL

本文介绍了使用 LXMERT 框架进行视觉 - 语言推理的方法。该框架包括一个基于 Transformer 模型的对象关系编码器、语言编码器和跨模态编码器，并通过大量图像 - 句子对进行预训练，以学习在这两个模态之间的关系。通过微调这个预训练模型，在两个视觉问答数据集上取得了最先进的结果，并在 NLVR2 数据集上将之前最好的结果提高了 22%。