Recent text-to-image diffusion-based generative models have the stunning
ability to generate highly detailed and photo-realistic images and achieve
state-of-the-art low FID scores on challenging image generation benchmarks.
However, one of the primary failure modes of these text-to-image generative
models is in composing attributes, objects, and their associated relationships
accurately into an image. In our paper, we investigate this
compositionality-based failure mode and highlight that imperfect text
conditioning with CLIP text-encoder is one of the primary reasons behind the
inability of these models to generate high-fidelity compositional scenes. In
particular, we show that (i) there exists an optimal text-embedding space that
can generate highly coherent compositional scenes which shows that the output
space of the CLIP text-encoder is sub-optimal, and (ii) we observe that the
final token embeddings in CLIP are erroneous as they often include attention
contributions from unrelated tokens in compositional prompts. Our main finding
shows that the best compositional improvements can be achieved (without harming
the model's FID scores) by fine-tuning {\it only} a simple linear projection on
CLIP's representation space in Stable-Diffusion variants using a small set of
compositional image-text pairs. This result demonstrates that the
sub-optimality of the CLIP's output space is a major error source. We also show
that re-weighting the erroneous attention contributions in CLIP can also lead
to improved compositional performances, however these improvements are often
less significant than those achieved by solely learning a linear projection
head, highlighting erroneous attentions to be only a minor error source.

通过研究基于组合性失败模式，我们发现文本到图像生成模型中 CLIP 文本编码器的文本条件不完备是无法生成高保真组合场景的主要原因，并提出仅通过在 CLIP 表示空间上学习简单的线性投影可以实现最佳组合性改进，同时不降低模型的 FID 分数。

文本到图像生成模型中的构成问题的理解和减轻

Understanding and Mitigating Compositional Issues in Text-to-Image  Generative Models

Deployed multimodal systems can fail in ways that evaluators did not
anticipate. In order to find these failures before deployment, we introduce
MultiMon, a system that automatically identifies systematic failures --
generalizable, natural-language descriptions of patterns of model failures. To
uncover systematic failures, MultiMon scrapes a corpus for examples of
erroneous agreement: inputs that produce the same output, but should not. It
then prompts a language model (e.g., GPT-4) to find systematic patterns of
failure and describe them in natural language. We use MultiMon to find 14
systematic failures (e.g., "ignores quantifiers") of the CLIP text-encoder,
each comprising hundreds of distinct inputs (e.g., "a shelf with a few/many
books"). Because CLIP is the backbone for most state-of-the-art multimodal
systems, these inputs produce failures in Midjourney 5.1, DALL-E, VideoFusion,
and others. MultiMon can also steer towards failures relevant to specific use
cases, such as self-driving cars. We see MultiMon as a step towards evaluation
that autonomously explores the long tail of potential system failures. Code for
MULTIMON is available at this https URL

MultiMon 通过自动识别系统性失败的方式，发现了 CLIP 文本编码器的 14 种系统性失败，是朝着自主探索潜在系统失败的长尾方向迈出的一步。