Generating longer textual sequences when conditioned on the visual information is an interesting problem to explore. The challenge here proliferate over the standard vision conditioned sentence-level generation (e.g., image or video captioning) as it requires to produce a brief and coherent story describing the visual content. In this paper, we mask this Vision-to-Sequence as Graph-to-Sequence learning problem and approach it with the Transformer architecture. To be specific, we introduce Sparse Graph-to-Sequence Transformer (SGST) for encoding the graph and decoding a sequence. The encoder aims to directly encode graph-level semantics, while the decoder is used to generate longer sequences. Experiments conducted with the benchmark image paragraph dataset show that our proposed achieve 13.3% improvement on the CIDEr evaluation measure when comparing to the previous state-of-the-art approach.

本文通过考虑图像内容的视觉信息生成长文本序列的问题，提出了SGST模型，它使用Transformer架构来解决图像段落到自然语言序列的问题，可以直接编码图层级语义，结果在图像段落数据集上相对于之前的最新成果提高了13.3%的CIDEr评估指标。

视觉指导下的稀疏图到序列学习，用于生成长文本序列