transformers have recently been shown to generate high quality images from texts. However, existing methods struggle to create high fidelity full-body images, especially multiple people. A person's pose has a high degree of freedom that is difficult to describe using words only; this c