We present a methodology for conditional control of human shape and pose in pretrained text-to-image diffusion models using a 3D human parametric model (SMPL). Fine-tuning these diffusion models to adhere to new conditions requires large datasets and high-quality annotations, which can be more cost-effectively acquired through synthetic data generation rather than real-world data. However, the domain gap and low scene diversity of synthetic data can compromise the pretrained model's visual fidelity. We propose a domain-adaptation technique that maintains image quality by isolating synthetically trained conditional information in the classifier-free guidance vector and composing it with another control network to adapt the generated images to the input domain. To achieve SMPL control, we fine-tune a ControlNet-based architecture on the synthetic SURREAL dataset of rendered humans and apply our domain adaptation at generation time. Experiments demonstrate that our model achieves greater shape and pose diversity than the 2d pose-based ControlNet, while maintaining the visual fidelity and improving stability, proving its usefulness for downstream tasks such as human animation.

本研究解决了在预训练文本到图像扩散模型中对人类形状和姿势进行有条件控制的挑战。我们提出了一种领域适应技术，通过在分类器自由引导向量中隔离合成训练的条件信息，并与另一个控制网络组合，以适应生成图像输入领域。实验结果表明，该模型在形状和姿势多样性方面优于传统方法，同时保持了视觉保真度，具有重要的下游应用潜力，如人类动画。

通过领域适应控制文本到图像扩散模型中的人类形状和姿势