Synthesizing images from a given text description involves engaging two types
of information: the content, which includes information explicitly described in
the text (e.g., color, composition, etc.), and the style, which is usually not
well described in the text (e.g., location, quantity, size, etc.). However, in
previous works, it is typically treated as a