Multiview diffusion models have shown considerable success in image-to-3D generation for general objects. However, when applied to human data, existing methods have yet to deliver promising results, largely due to the challenges of scaling multiview attention to higher resolutions. In this paper, we explore human multiview diffusion models at the megapixel level and introduce a solution called mesh attention to enable training at 1024x1024 resolution. Using a clothed human mesh as a central coarse geometric representation, the proposed mesh attention leverages rasterization and projection to establish direct cross-view coordinate correspondences. This approach significantly reduces the complexity of multiview attention while maintaining cross-view consistency. Building on this foundation, we devise a mesh attention block and combine it with keypoint conditioning to create our human-specific multiview diffusion model, MEAT. In addition, we present valuable insights into applying multiview human motion videos for diffusion training, addressing the longstanding issue of data scarcity. Extensive experiments show that MEAT effectively generates dense, consistent multiview human images at the megapixel level, outperforming existing multiview diffusion methods.

该研究解决了现有多视角扩散模型在生成高分辨率人类图像时遇到的挑战，特别是在提升到百万像素级别时的效果不佳。通过引入网格注意力机制，该方法在1024x1024分辨率下实现了高效训练，显著简化了多视角注意力的复杂性并保持视角一致性。实验结果表明，MEAT模型在生成密集且一致的人类多视角图像方面优于现有方法。

MEAT：用于人类生成的多视角扩散模型，具有网格注意力机制