Transformers have recently been shown to generate high quality images from texts. However, existing methods struggle to create high fidelity full-body images, especially multiple people. A person's pose has a high degree of freedom that is difficult to describe using words only; this creates errors in the generated image, such as incorrect body proportions and pose. We propose a pose-guided text-to-image model, using pose as an additional input constraint. Using the proposed Keypoint Pose Encoding (KPE) to encode human pose into low dimensional representation, our model can generate novel multi-person images accurately representing the pose and text descriptions provided, with minimal errors. We demonstrate that KPE is invariant to changes in the target image domain and image resolution; we show results on the Deepfashion dataset and create a new multi-person Deepfashion dataset to demonstrate the multi-capabilities of our approach.

本研究提出Keypoint Pose Encoding (KPE)方法，相较于现有的姿态调节方法更加高效、快速，能够生成高质量的、基于姿态调节的图像。该方法还具有域不变性，易于扩展到更高分辨率的图像，并且引入了People Count Error (PCE)评估方法来检测生成的人物图像的错误。

基于变压器模型的图像生成的关键点姿态编码