Open vocabulary object detection has been greatly advanced by the recent
development of vision-language pretrained model, which helps recognize novel
objects with only semantic categories. The prior works mainly focus on
knowledge transferring to the object proposal classification and employ
class-agnostic box and mask prediction. In this work, we propose CondHead, a
principled dynamic network design to better generalize the box regression and
mask segmentation for open vocabulary setting. The core idea is to
conditionally parameterize the network heads on semantic embedding and thus the
model is guided with class-specific knowledge to better detect novel
categories. Specifically, CondHead is composed of two streams of network heads,
the dynamically aggregated head and the dynamically generated head. The former
is instantiated with a set of static heads that are conditionally aggregated,
these heads are optimized as experts and are expected to learn sophisticated
prediction. The latter is instantiated with dynamically generated parameters
and encodes general class-specific information. With such a conditional design,
the detection model is bridged by the semantic embedding to offer strongly
generalizable class-wise box and mask prediction. Our method brings significant
improvement to the state-of-the-art open vocabulary object detection methods
with very minor overhead, e.g., it surpasses a RegionClip model by 3.0
detection AP on novel categories, with only 1.1% more computation.

本文提出 CondHead 作为一种动态网络设计，通过对语义嵌入条件参数化来指导模型根据类特定知识更好地检测新类别，从而使检测模型通过语义嵌入提供强大的可推广类别框和掩膜预测，并在非常小的开销下显著改善了开放词汇的目标检测方法。

开放词汇物体检测的学习与分割

Learning to Detect and Segment for Open Vocabulary Object Detection

We present a unified Vision-Language pretrained Model (VLMo) that jointly
learns a dual encoder and a fusion encoder with a modular Transformer network.
Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer,
where each block contains a pool of modality-specific experts and a shared
self-attention layer. Because of the modeling flexibility of MoME, pretrained
VLMo can be fine-tuned as a fusion encoder for vision-language classification
tasks, or used as a dual encoder for efficient image-text retrieval. Moreover,
we propose a stagewise pre-training strategy, which effectively leverages
large-scale image-only and text-only data besides image-text pairs.
Experimental results show that VLMo achieves state-of-the-art results on
various vision-language tasks, including VQA, NLVR2 and image-text retrieval.
The code and pretrained models are available at this https URL

本研究提出了统一的视觉 - 语言预训练模型 (VLMo)，通过模块化的 Transformer 网络共同学习双编码器和融合编码器。实验结果表明，VLMo 在各种视觉 - 语言任务中取得了最先进的结果。