Advances in text-based image generation and editing have revolutionized
content creation, enabling users to create impressive content from imaginative
text prompts. However, existing methods are not designed to work well with the
oversimplified prompts that are often encountered in typical scenarios when
users start their editing with only vague or abstract purposes in mind. Those
scenarios demand elaborate ideation efforts from the users to bridge the gap
between such vague starting points and the detailed creative ideas needed to
depict the desired results. In this paper, we introduce the task of Image
Editing Recommendation (IER). This task aims to automatically generate diverse
creative editing instructions from an input image and a simple prompt
representing the users' under-specified editing purpose. To this end, we
introduce Creativity-Vision Language Assistant~(Creativity-VLA), a multimodal
framework designed specifically for edit-instruction generation. We train
Creativity-VLA on our edit-instruction dataset specifically curated for IER. We
further enhance our model with a novel 'token-for-localization' mechanism,
enabling it to support both global and local editing operations. Our
experimental results demonstrate the effectiveness of \ours{} in suggesting
instructions that not only contain engaging creative elements but also maintain
high relevance to both the input image and the user's initial hint.

该研究论文介绍了图像编辑推荐任务，通过 Creativity-Vision Language Assistant 的训练和编辑指令数据集的提供，实现了从输入图像和简单提示生成多样创意编辑指令的目标。通过引入新颖的 ' 用于本地化的标记 ' 机制，我们的模型支持全局和局部编辑操作，并证明了其在提供具有魅力的创意元素且与输入图像和用户初始提示高度相关的指令方面的有效性。

激发视觉创造力：图像编辑建议的视觉语言助手

Empowering Visual Creativity: A Vision-Language Assistant to Image  Editing Recommendations

Diffusion models have achieved remarkable results in generating high-quality,
diverse, and creative images. However, when it comes to text-based image
generation, they often fail to capture the intended meaning presented in the
text. For instance, a specified object may not be generated, an unnecessary
object may be generated, and an adjective may alter objects it was not intended
to modify. Moreover, we found that relationships indicating possession between
objects are often overlooked. While users' intentions in text are diverse,
existing methods tend to specialize in only some aspects of these. In this
paper, we propose Predicated Diffusion, a unified framework to express users'
intentions. We consider that the root of the above issues lies in the text
encoder, which often focuses only on individual words and neglects the logical
relationships between them. The proposed method does not solely rely on the
text encoder, but instead, represents the intended meaning in the text as
propositions using predicate logic and treats the pixels in the attention maps
as the fuzzy predicates. This enables us to obtain a differentiable loss
function that makes the image fulfill the proposition by minimizing it. When
compared to several existing methods, we demonstrated that Predicated Diffusion
can generate images that are more faithful to various text prompts, as verified
by human evaluators and pretrained image-text models.

在这篇论文中，我们提出了一种称为预测扩散的统一框架来表达用户的意图，通过使用谓词逻辑将文本中的预期含义表示为命题，并将注意力图中的像素视为模糊谓词，以获取使图像满足命题的可微损失函数。与多种现有方法相比，我们证明了预测扩散能够生成更符合不同文本提示的图像，经由人工评估者和预训练图像 - 文本模型进行验证。

谓词扩散：基于谓词逻辑的文本到图像扩散模型的注意力引导

Predicated Diffusion: Predicate Logic-Based Attention Guidance for  Text-to-Image Diffusion Models

The application of zero-shot learning in computer vision has been
revolutionized by the use of image-text matching models. The most notable
example, CLIP, has been widely used for both zero-shot classification and
guiding generative models with a text prompt. However, the zero-shot use of
CLIP is unstable with respect to the phrasing of the input text, making it
necessary to carefully engineer the prompts used. We find that this instability
stems from a selective similarity score, which is based only on a subset of the
semantically meaningful input tokens. To mitigate it, we present a novel
explainability-based approach, which adds a loss term to ensure that CLIP
focuses on all relevant semantic parts of the input, in addition to employing
the CLIP similarity loss used in previous works. When applied to one-shot
classification through prompt engineering, our method yields an improvement in
the recognition rate, without additional training or fine-tuning. Additionally,
we show that CLIP guidance of generative models using our method significantly
improves the generated images. Finally, we demonstrate a novel use of CLIP
guidance for text-based image generation with spatial conditioning on object
location, by requiring the image explainability heatmap for each object to be
confined to a pre-determined bounding box.

本研究提出了一种基于解释性的方法来解决在零样本学习和图像生成中使用 CLIP 时输入文本的稳定性问题，此方法通过增加一项损失项来确保 CLIP 关注所有相关的语义部分，并且可以提高图像识别率和生成图像的质量。同时，研究还展示了 CLIP 在一次性分类、对生成模型进行指导和有空间条件的基于文本的图像生成方面的新型应用。