Most text-to-image customization techniques fine-tune models on a small set of \emph{personal concept} images captured in minimal contexts. This often results in the model becoming overfitted to these training images and unable to generalize to new contexts in future text prompts. Existing customization methods are built on the success of effectively representing personal concepts as textual embeddings. Thus, in this work, we resort to diversifying the context of these personal concepts \emph{solely} within the textual space by simply creating a contextually rich set of text prompts, together with a widely used self-supervised learning objective. Surprisingly, this straightforward and cost-effective method significantly improves semantic alignment in the textual space, and this effect further extends to the image space, resulting in higher prompt fidelity for generated images. Additionally, our approach does not require any architectural modifications, making it highly compatible with existing text-to-image customization methods. We demonstrate the broad applicability of our approach by combining it with four different baseline methods, achieving notable CLIP score improvements.

本研究解决了现有文本到图像定制技术因训练数据过于局限而导致的泛化能力不足的问题。作者提出通过构建丰富的文本提示集来多样化个人概念的上下文，从而显著提高了语义对齐，并提升了生成图像的保真度。该方法不需要架构修改，兼容现有的定制技术，扩展性强。 

在多样化背景下学习定制文本到图像的扩散模型