Image-to-text tasks, such as open-ended image captioning and controllable
image description, have received extensive attention for decades. Here, we
further advance this line of work by presenting Visual Spatial Description
(VSD), a new perspective for image-to-text toward spatial semantics. Given an
image and two objects inside it, VSD aims to produce one description focusing
on the spatial perspective between the two objects. Accordingly, we manually
annotate a dataset to facilitate the investigation of the newly-introduced task
and build several benchmark encoder-decoder models by using VL-BART and VL-T5
as backbones. In addition, we investigate pipeline and joint end-to-end
architectures for incorporating visual spatial relationship classification
(VSRC) information into our model. Finally, we conduct experiments on our
benchmark dataset to evaluate all our models. Results show that our models are
impressive, providing accurate and human-like spatial-oriented text
descriptions. Meanwhile, VSRC has great potential for VSD, and the joint
end-to-end architecture is the better choice for their integration. We make the
dataset and codes public for research purposes.

提出了一种名为 VSD 的新的图像与文本方向，其着眼于空间语义，通过使用 VL-BART 和 VL-T5 作为支撑，构建了几个基准编码 - 解码模型，并在我们的基准测试集上进行实验，结果显示我们的模型性能令人印象深刻。同时 VSRC 将会有巨大的潜力，而联合端到端架构是更好的选择。