Text-to-image multimodal tasks, generating/retrieving an image from a given
text description, are extremely challenging tasks since raw text descriptions
cover quite limited information in order to fully describe visually realistic
images. We propose a new visual contextual text representation for
text-to-image multimodal tasks, VICTR, which captures rich visual semantic
information of objects from the text input. First, we use the text description
as initial input and conduct dependency parsing to extract the syntactic
structure and analyse the semantic aspect, including object quantities, to
extract the scene graph. Then, we train the extracted objects, attributes, and
relations in the scene graph and the corresponding geometric relation
information using Graph Convolutional Networks, and it generates text
representation which integrates textual and visual semantic information. The
text representation is aggregated with word-level and sentence-level embedding
to generate both visual contextual word and sentence representation. For the
evaluation, we attached VICTR to the state-of-the-art models in text-to-image
generation.VICTR is easily added to existing models and improves across both
quantitative and qualitative aspects.

本文提出了一种新的视觉上下文文本表示方法，VICTR，用于处理文本到图像的多模态任务，通过使用图卷积网络和文本表征的结合，有效地捕捉了文本语义中的视觉特征信息，实现了在实验中得到的量化和定性的改进。