Various works have been extensively studied in the research of text-to-image
generation. Although existing models perform well in text-to-image generation,
there are significant challenges when directly employing them to generate
images in dialogs. In this paper, we first highlight a n