Recent advances in tuning-free personalized image generation based on
diffusion models are impressive. However, to improve subject fidelity, existing
methods either retrain the diffusion model or infuse it with dense visual
embeddings, both of which suffer from poor generalization and efficiency. Also,
these methods falter in multi-subject image generation due to the unconstrained
cross-attention mechanism. In this paper, we propose MM-Diff, a unified and
tuning-free image personalization framework capable of generating high-fidelity
images of both single and multiple subjects in seconds. Specifically, to
simultaneously enhance text consistency and subject fidelity, MM-Diff employs a
vision encoder to transform the input image into CLS and patch embeddings. CLS
embeddings are used on the one hand to augment the text embeddings, and on the
other hand together with patch embeddings to derive a small number of
detail-rich subject embeddings, both of which are efficiently integrated into
the diffusion model through the well-designed multimodal cross-attention
mechanism. Additionally, MM-Diff introduces cross-attention map constraints
during the training phase, ensuring flexible multi-subject image sampling
during inference without any predefined inputs (e.g., layout). Extensive
experiments demonstrate the superior performance of MM-Diff over other leading
methods.

为了提高主题准确性，我们提出了 MM-Diff 的统一且无需调参的个性化图像生成框架，能够在几秒钟内生成单个和多个主题的高保真图像。MM-Diff 利用视觉编码器将输入图像转换为 CLS 和 patch 嵌入，而通过设计精良的多模态交叉注意机制，CLS 嵌入一方面用于增强文本嵌入，另一方面与 patch 嵌入一起用于生成少量细节丰富的主题嵌入，并且在训练过程中引入了交叉注意图约束，确保推理过程中的灵活多主题图像采样。大量实验证明了 MM-Diff 相对于其他主要方法的优越性能。