The recent progress in diffusion-based text-to-image generation models has significantly expanded generative capabilities via conditioning the text descriptions. However, since relying solely on text prompts is still restrictive for fine-grained customization, we aim to extend the boundaries of conditional generation to incorporate diverse types of modalities, e.g., sketch, box, and style embedding, simultaneously. We thus design a multimodal text-to-image diffusion model, coined as DiffBlender, that achieves the aforementioned goal in a single model by training only a few small hypernetworks. DiffBlender facilitates a convenient scaling of input modalities, without altering the parameters of an existing large-scale generative model to retain its well-established knowledge. Furthermore, our study sets new standards for multimodal generation by conducting quantitative and qualitative comparisons with existing approaches. By diversifying the channels of conditioning modalities, DiffBlender faithfully reflects the provided information or, in its absence, creates imaginative generation.

通过设计一种多模态文本到图像扩散模型（DiffBlender），可以同时引入多种不同类型的细节表达方式，如草图、盒子和风格嵌入等，不需要更改现有模型的参数，从而在单个模型中实现条件生成，并且通过量化和定性比较，将多模态生成的标准提高到了新的水平。

DiffBlender: 可扩展和可组合的多模态文本到图像扩散模型