Representing the cutting-edge technique of text-to-image models, the latest Multimodal Diffusion Transformer (MMDiT) largely mitigates many generation issues existing in previous models. However, we discover that it still suffers from subject neglect or mixing when the input text prompt contains multiple subjects of similar semantics or appearance. We identify three possible ambiguities within the MMDiT architecture that cause this problem: Inter-block Ambiguity, Text Encoder Ambiguity, and Semantic Ambiguity. To address these issues, we propose to repair the ambiguous latent on-the-fly by test-time optimization at early denoising steps. In detail, we design three loss functions: Block Alignment Loss, Text Encoder Alignment Loss, and Overlap Loss, each tailored to mitigate these ambiguities. Despite significant improvements, we observe that semantic ambiguity persists when generating multiple similar subjects, as the guidance provided by overlap loss is not explicit enough. Therefore, we further propose Overlap Online Detection and Back-to-Start Sampling Strategy to alleviate the problem. Experimental results on a newly constructed challenging dataset of similar subjects validate the effectiveness of our approach, showing superior generation quality and much higher success rates over existing methods. Our code will be available at https://github.com/wtybest/EnMMDiT.

本研究解决了MMDiT模型在多个具有相似语义或外观的主题输入时的生成问题，包括主题忽视和混合现象。我们提出了一种动态修复模糊潜在表示的方法，通过设计三种损失函数来应对这些模糊点，并进而提出在线重叠检测和回到起始采样策略，以提高生成质量和成功率。实验结果表明，该方法在相似主题生成任务中显著优于现有技术。

提升基于MMDiT的文本到图像模型以生成相似主题