Object detection, scene graph generation and region captioning, which are three scene understanding tasks at different semantic levels, are tied together: scene graphs are generated on top of objects detected in an image with their pairwise relationship predicted, while region captioning gives a language description of the objects, their attributes, relations, and other context information. In this work, to leverage the mutual connections across semantic levels, we propose a novel neural network model, termed as Multi-level Scene Description Network (denoted as MSDN), to solve the three vision tasks jointly in an end-to-end manner. Objects, phrases, and caption regions are first aligned with a dynamic graph based on their spatial and semantic connections. Then a feature refining structure is used to pass messages across the three levels of semantic tasks through the graph. We benchmark the learned model on three tasks, and show the joint learning across three tasks with our proposed method can bring mutual improvements over previous models. Particularly, on the scene graph generation task, our proposed method outperforms the state-of-art method with more than 3% margin.

本文提出了一种新颖的神经网络模型，名为多级场景描述网络(MSDN)，通过动态图对对象、短语和描述区域进行对齐，并使用特征细化结构在三个语义任务的三个级别之间传递消息，从而以端到端方式共同解决三个视觉任务，包括目标检测、场景图生成和区域字幕。经过实验验证，该方法可以取得较好的效果。

从物体、短语和区域说明生成场景图