We present YORO - a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task. This task involves localizing, in an image, an object referred via natural language. Unlike the recent trend in the literature of using multi-stage approaches that sacrifice speed for accuracy, YORO seeks a better trade-off between speed an accuracy by embracing a single-stage design, without CNN backbone. YORO consumes natural language queries, image patches, and learnable detection tokens and predicts coordinates of the referred object, using a single transformer encoder. To assist the alignment between text and visual objects, a novel patch-text alignment loss is proposed. Extensive experiments are conducted on 5 different datasets with ablations on architecture design choices. YORO is shown to support real-time inference and outperform all approaches in this class (single-stage methods) by large margins. It is also the fastest VG model and achieves the best speed/accuracy trade-off in the literature.

本文介绍了一种名为 YORO 的多模态 Transformer 编码器架构，用于视觉定位任务，其采用单阶段设计，不使用 CNN 背景，通过消耗自然语言查询、图像块和可学习的检测令牌来预测所参考对象的坐标，并提出了新的贴片文本对齐损失。通过在不同的数据集中进行广泛的实验，该方法在速度和精度之间取得了更好的平衡，支持实时推理，并在这一类（单阶段方法）中具有最佳的速度/精度权衡，并击败了所有现有方法。

YORO -- 轻量级端到端视觉定位