Typical methods for text-to-image synthesis seek to design effective
generative architecture to model the text-to-image mapping directly. It is
fairly arduous due to the cross-modality translation. In this paper we
circumvent this problem by focusing on parsing the content of both the