Transformers have been widely used in numerous vision problems especially for visual recognition and detection. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector, followed by a computationally efficient transformer decoder that exploits multi-scale features and auxiliary techniques essential to boost the detection performance without much increase in computational load. In addition, we extend it to ViDT+ to support joint-task learning for object detection and instance segmentation. Specifically, we attach an efficient multi-scale feature fusion layer and utilize two more auxiliary training losses, IoU-aware loss and token labeling loss. Extensive evaluation results on the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP and latency trade-off among existing fully transformer-based object detectors, and its extended ViDT+ achieves 53.2AP owing to its high scalability for large models. The source code and trained models are available at https://github.com/naver-ai/vidt.

本文介绍了Vision和Detection Transformers（ViDT），ViDT 是一个有效和高效的物体检测器，它通过重新配置注意力模块来扩展 Swin Transformer 为独立的物体检测器，并采用多尺度特征和辅助技术来提高检测性能，同时还支持对象检测和实例分割的联合任务学习。该技术已在 Microsoft COCO 基准数据集上获得广泛的评估结果，是目前完全基于 Transformer 的最佳物体检测器之一。

一种可扩展、高效、有效的基于Transformer的物体检测器