Recent large-scale text-driven synthesis models have attracted much attention
thanks to their remarkable capabilities of generating highly diverse images
that follow given text prompts. Such text-based synthesis methods are
particularly appealing to humans who are used to verbally desc