Multi-dataset training provides a viable solution for exploiting
heterogeneous large-scale datasets without extra annotation cost. In this work,
we propose a scalable multi-dataset detector (ScaleDet) that can scale up its
generalization across datasets when increasing the number of training datasets.
Unlike existing multi-dataset learners that mostly rely on manual relabelling
efforts or sophisticated optimizations to unify labels across datasets, we
introduce a simple yet scalable formulation to derive a unified semantic label
space for multi-dataset training. ScaleDet is trained by visual-textual
alignment to learn the label assignment with label semantic similarities across
datasets. Once trained, ScaleDet can generalize well on any given upstream and
downstream datasets with seen and unseen classes. We conduct extensive
experiments using LVIS, COCO, Objects365, OpenImages as upstream datasets, and
13 datasets from Object Detection in the Wild (ODinW) as downstream datasets.
Our results show that ScaleDet achieves compelling strong model performance
with an mAP of 50.7 on LVIS, 58.8 on COCO, 46.8 on Objects365, 76.2 on
OpenImages, and 71.8 on ODinW, surpassing state-of-the-art detectors with the
same backbone.

本文提出了一种可扩展的多数据集检测器 (ScaleDet)，使用语义标签相似性通过视觉 - 文字对齐训练来学习标签分配，从而在多个数据集上实现了强大的模型性能，超越了相同骨干网络的最新检测器。

ScaleDet: 一种可扩展的多数据集对象检测器

ScaleDet: A Scalable Multi-Dataset Object Detector

In computer vision, multi-label classification, including zero-shot
multi-label classification are important tasks with many real-world
applications. In this paper, we propose a novel algorithm, Aligned Dual
moDality ClaSsifier (ADDS), which includes a Dual-Modal decoder (DM-decoder)
with alignment between visual and textual features, for multi-label
classification tasks. Moreover, we design a simple and yet effective method
called Pyramid-Forwarding to enhance the performance for inputs with high
resolutions. Extensive experiments conducted on standard multi-label benchmark
datasets, MS-COCO and NUS-WIDE, demonstrate that our approach significantly
outperforms previous methods and provides state-of-the-art performance for
conventional multi-label classification, zero-shot multi-label classification,
and an extreme case called single-to-multi label classification where models
trained on single-label datasets (ImageNet-1k, ImageNet-21k) are tested on
multi-label ones (MS-COCO and NUS-WIDE). We also analyze how visual-textual
alignment contributes to the proposed approach, validate the significance of
the DM-decoder, and demonstrate the effectiveness of Pyramid-Forwarding on
vision transformer.

该论文提出了一个新算法 - Aligned Dual moDality ClaSsifier (ADDS)，其中包括一个双模式解码器 (DM-decoder) 和视觉和文本特征之间的对齐，用于多标签分类任务，并设计了一种称为金字塔前馈 (Pyramid-Forwarding) 的方法来增强输入的性能，通过在多个基准数据集，如 MS-COCO 和 NUS-WIDE 上进行了广泛的实验，证明了该方法显著优于以前的方法，并为传统的多标签分类，零样本多标签分类，以及单到多标签分类提供了最先进的性能。