In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e.g., phrase localization, referring expression comprehension (REC) and segmentation (RES). The canonical paradigms for visual grounding often require substantial expertise in designing network architectures and loss functions, making them hard to generalize across tasks. To simplify and unify the modeling, we cast visual grounding as a point prediction problem conditioned on image and text inputs, where either the bounding box or binary mask is represented as a sequence of discrete coordinate tokens. Under this paradigm, visual grounding tasks are unified in our SeqTR network without task-specific branches or heads, e.g., the convolutional mask decoder for RES, which greatly reduces the complexity of multi-task modeling. In addition, SeqTR also shares the same optimization objective for all tasks with a simple cross-entropy loss, further reducing the complexity of deploying hand-crafted loss functions. Experiments on five benchmark datasets demonstrate that the proposed SeqTR outperforms (or is on par with) the existing state-of-the-arts, proving that a simple yet universal approach for visual grounding is indeed feasible.

本文提出了一个名为SeqTR的简单且通用的网络，用于视觉定位任务和指代表达理解任务，通过将视觉定位问题视为图像和文本输入的点预测问题，可以在SeqTR网络中统一视觉定位任务而无需任务特定的分支或头，使用简单的交叉熵损失进一步降低了手工损失函数的复杂性，并且在五个基准数据集上进行的实验证明了SeqTR的可行性和优越性。

SeqTR: 一种简单而通用的视觉定位网络