Different from Object Detection, Visual Grounding deals with detecting a bounding box for each text-image pair. This one box for each text-image data provides sparse supervision signals. Although previous works achieve impressive results, their passive utilization of annotation, i.e. the sole use of the box annotation as regression ground truth, results in a suboptimal performance. In this paper, we present SegVG, a novel method transfers the box-level annotation as Segmentation signals to provide an additional pixel-level supervision for Visual Grounding. Specifically, we propose the Multi-layer Multi-task Encoder-Decoder as the target grounding stage, where we learn a regression query and multiple segmentation queries to ground the target by regression and segmentation of the box in each decoding layer, respectively. This approach allows us to iteratively exploit the annotation as signals for both box-level regression and pixel-level segmentation. Moreover, as the backbones are typically initialized by pretrained parameters learned from unimodal tasks and the queries for both regression and segmentation are static learnable embeddings, a domain discrepancy remains among these three types of features, which impairs subsequent target grounding. To mitigate this discrepancy, we introduce the Triple Alignment module, where the query, text, and vision tokens are triangularly updated to share the same space by triple attention mechanism. Extensive experiments on five widely used datasets validate our state-of-the-art (SOTA) performance.

SegVG是一种新颖的方法，通过将边界框级别的注释转化为分割信号，为视觉定位任务提供了像素级别的监督。通过多层多任务编码器-解码器，我们学习了回归查询和多个分割查询，以在每个解码层中通过回归和分割来定位目标。通过三重对齐模块来减少域间差异，该模块使用三重注意机制来更新查询、文本和视觉特征，从而提升了目标定位性能。在五个广泛使用的数据集上进行的大量实验证实了我们的卓越性能。

SegVG：将物体边界框转化为分割图像以进行视觉对齐