Can Transformer perform $2\mathrm{D}$ object-level recognition from a pure sequence-to-sequence perspective with minimal knowledge about the $2\mathrm{D}$ spatial structure? To answer this question, we present You Only Look at One Sequence (YOLOS), a series of object detection models based on the na\"ive Vision Transformer with the fewest possible modifications as well as inductive biases. We find that YOLOS pre-trained on the mid-sized ImageNet-$1k$ dataset only can already achieve competitive object detection performance on COCO, \textit{e.g.}, YOLOS-Base directly adopted from BERT-Base can achieve $42.0$ box AP. We also discuss the impacts as well as limitations of current pre-train schemes and model scaling strategies for Transformer in vision through object detection. Code and model weights are available at \url{https://github.com/hustvl/YOLOS}.

本文通过YOLOS模型系列探讨Transformer在2D对象和区域级别识别上的性能，并发现在中型ImageNet-1k数据集上预训练的YOLOS模型已经可以在COCO目标检测基准测试中实现相当竞争的性能。同时，作者还讨论了当前的预训练模式和模型尺度策略以及YOLOS模型的影响和局限性。

只看一个序列：通过目标检测重新思考视觉 Transformer