Visual spatial description (VSD) aims to generate texts that describe the
spatial relations of the given objects within images. Existing VSD work merely
models the 2D geometrical vision features, thus inevitably falling prey to the
problem of skewed spatial understanding of target objects. In this work, we
investigate the incorporation of 3D scene features for VSD. With an external 3D
scene extractor, we obtain the 3D objects and scene features for input images,
based on which we construct a target object-centered 3D spatial scene graph
(Go3D-S2G), such that we model the spatial semantics of target objects within
the holistic 3D scenes. Besides, we propose a scene subgraph selecting
mechanism, sampling topologically-diverse subgraphs from Go3D-S2G, where the
diverse local structure features are navigated to yield spatially-diversified
text generation. Experimental results on two VSD datasets demonstrate that our
framework outperforms the baselines significantly, especially improving on the
cases with complex visual spatial relations. Meanwhile, our method can produce
more spatially-diversified generation. Code is available at
this https URL

本文研究了如何使用三维场景特征来提高视觉空间描述（VSD）的准确度和多样性，通过构建一个基于目标对象的三维空间场景图和场景子图选择机制，从而实现更加多样空间的文本生成，实验证明这种方法在视觉空间关系复杂的情况下表现明显优于基线模型。

通过整体三维场景理解生成视觉空间描述

Generating Visual Spatial Description via Holistic 3D Scene  Understanding

Image-to-text tasks, such as open-ended image captioning and controllable
image description, have received extensive attention for decades. Here, we
further advance this line of work by presenting Visual Spatial Description
(VSD), a new perspective for image-to-text toward spatial semantics. Given an
image and two objects inside it, VSD aims to produce one description focusing
on the spatial perspective between the two objects. Accordingly, we manually
annotate a dataset to facilitate the investigation of the newly-introduced task
and build several benchmark encoder-decoder models by using VL-BART and VL-T5
as backbones. In addition, we investigate pipeline and joint end-to-end
architectures for incorporating visual spatial relationship classification
(VSRC) information into our model. Finally, we conduct experiments on our
benchmark dataset to evaluate all our models. Results show that our models are
impressive, providing accurate and human-like spatial-oriented text
descriptions. Meanwhile, VSRC has great potential for VSD, and the joint
end-to-end architecture is the better choice for their integration. We make the
dataset and codes public for research purposes.

提出了一种名为 VSD 的新的图像与文本方向，其着眼于空间语义，通过使用 VL-BART 和 VL-T5 作为支撑，构建了几个基准编码 - 解码模型，并在我们的基准测试集上进行实验，结果显示我们的模型性能令人印象深刻。同时 VSRC 将会有巨大的潜力，而联合端到端架构是更好的选择。