In this paper, we propose a transformer based approach for visual grounding.
Unlike previous proposal-and-rank frameworks that rely heavily on pretrained
object detectors or proposal-free frameworks that upgrade an off-the-shelf
one-stage detector by fusing textual embeddings, our approach is built on top
of a transformer encoder-decoder and is independent of any pretrained detectors
or word embedding models. Termed VGTR -- Visual Grounding with TRansformers,
our approach is designed to learn semantic-discriminative visual features under
the guidance of the textual description without harming their location ability.
This information flow enables our VGTR to have a strong capability in capturing
context-level semantics of both vision and language modalities, rendering us to
aggregate accurate visual clues implied by the description to locate the
interested object instance. Experiments show that our method outperforms
state-of-the-art proposal-free approaches by a considerable margin on five
benchmarks while maintaining fast inference speed.

该论文提出了一种基于 Transformer 编码器 - 解码器的视觉 grounding 方法，通过在不损伤位置定位能力的前提下，在文本描述的指导下学习语义鉴别的视觉特征，具有强大的文本 - 视觉语境语义捕捉能力。实验结果表明，在保持快速推理速度的同时，该方法在五个基准上优于现有的提案 - free 方法。