We present X-MDPT (Cross-view Masked Diffusion Prediction Transformers), a novel diffusion model designed for pose-guided human image generation. X-MDPT distinguishes itself by employing masked diffusion transformers that operate on latent patches, a departure from the commonly-used Unet structures in existing works. The model comprises three key modules: 1) a denoising diffusion Transformer, 2) an aggregation network that consolidates conditions into a single vector for the diffusion process, and 3) a mask cross-prediction module that enhances representation learning with semantic information from the reference image. X-MDPT demonstrates scalability, improving FID, SSIM, and LPIPS with larger models. Despite its simple design, our model outperforms state-of-the-art approaches on the DeepFashion dataset while exhibiting efficiency in terms of training parameters, training time, and inference speed. Our compact 33MB model achieves an FID of 7.42, surpassing a prior Unet latent diffusion approach (FID 8.07) using only $11\times$ fewer parameters. Our best model surpasses the pixel-based diffusion with $\frac{2}{3}$ of the parameters and achieves $5.43 \times$ faster inference.

X-MDPT是一种新颖的扩散模型，用于姿势引导的人体图像生成，采用了基于掩模的扩散变换器，通过对潜在补丁的操作，与现有作品中常用的Unet结构有所不同。该模型包括三个关键模块：去噪扩散变换器，将条件整合成单一向量进行扩散过程的聚合网络，以及通过参考图像中的语义信息增强表示学习的掩模交叉预测模块。X-MDPT在更大模型下展示了可扩展性，在DeepFashion数据集上优于现有方法，并在训练参数、训练时间和推理速度方面表现出高效性。我们的33MB紧凑模型在FID为7.42时超过了使用11倍少参数的先前Unet潜在扩散方法（FID 8.07）。我们的最佳模型相比像素级扩散使用了2/3的参数，并实现了5.43倍的更快推理。

跨视角掩蔽扩散变压器用于人物图像合成