Text-to-image diffusion models have proven effective for solving many image
editing tasks. However, the seemingly straightforward task of seamlessly
relocating objects within a scene remains surprisingly challenging. Existing
methods addressing this problem often struggle to function reliably in
real-world scenarios due to lacking spatial reasoning. In this work, we propose
a training-free method, dubbed DiffUHaul, that harnesses the spatial
understanding of a localized text-to-image model, for the object dragging task.
Blindly manipulating layout inputs of the localized model tends to cause low
editing performance due to the intrinsic entanglement of object representation
in the model. To this end, we first apply attention masking in each denoising
step to make the generation more disentangled across different objects and
adopt the self-attention sharing mechanism to preserve the high-level object
appearance. Furthermore, we propose a new diffusion anchoring technique: in the
early denoising steps, we interpolate the attention features between source and
target images to smoothly fuse new layouts with the original appearance; in the
later denoising steps, we pass the localized features from the source images to
the interpolated images to retain fine-grained object details. To adapt
DiffUHaul to real-image editing, we apply a DDPM self-attention bucketing that
can better reconstruct real images with the localized model. Finally, we
introduce an automated evaluation pipeline for this task and showcase the
efficacy of our method. Our results are reinforced through a user preference
study.

通过局部的文本到图像模型的空间理解，提出了一种不需要训练的方法 DiffUHaul，用于对象拖动任务，并通过注意力掩蔽、自我注意力共享机制和扩散锚定技术来改进编辑性能，并使用 DDPM 自我注意力分桶来适应真实图像编辑。

DiffUHaul: 图像中无需训练的物体拖动方法

DiffUHaul: A Training-Free Method for Object Dragging in Images

A plethora of text-guided image editing methods have recently been developed
by leveraging the impressive capabilities of large-scale diffusion-based
generative models such as Imagen and Stable Diffusion. A standardized
evaluation protocol, however, does not exist to compare methods across
different types of fine-grained edits. To address this gap, we introduce
EditVal, a standardized benchmark for quantitatively evaluating text-guided
image editing methods. EditVal consists of a curated dataset of images, a set
of editable attributes for each image drawn from 13 possible edit types, and an
automated evaluation pipeline that uses pre-trained vision-language models to
assess the fidelity of generated images for each edit type. We use EditVal to
benchmark 8 cutting-edge diffusion-based editing methods including SINE, Imagic
and Instruct-Pix2Pix. We complement this with a large-scale human study where
we show that EditVall's automated evaluation pipeline is strongly correlated
with human-preferences for the edit types we considered. From both the human
study and automated evaluation, we find that: (i) Instruct-Pix2Pix, Null-Text
and SINE are the top-performing methods averaged across different edit types,
however {\it only} Instruct-Pix2Pix and Null-Text are able to preserve original
image properties; (ii) Most of the editing methods fail at edits involving
spatial operations (e.g., changing the position of an object). (iii) There is
no `winner' method which ranks the best individually across a range of
different edit types. We hope that our benchmark can pave the way to developing
more reliable text-guided image editing tools in the future. We will publicly
release EditVal, and all associated code and human-study templates to support
these research directions in this https URL

通过引入 EditVal，这是一个标准化的用于定量评估文本引导的图像编辑方法的基准测试，本研究对 8 种前沿扩散编辑方法进行了基准测试，发现 Instruct-Pix2Pix 和 Null-Text 的性能最好且能保持原始图像特性，而大多数编辑方法在空间操作方面失败，没有一个单独在各种编辑类型上排名最佳的方法。希望我们的基准测试能为未来开发更可靠的文本引导图像编辑工具铺平道路。

EditVal: 基于扩散的文本引导图像编辑方法的基准测试

EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods

Spatial understanding is a fundamental aspect of computer vision and integral
for human-level reasoning about images, making it an important component for
grounded language understanding. While recent large-scale text-to-image
synthesis (T2I) models have shown unprecedented improvements in photorealism,
it is unclear whether they have reliable spatial understanding capabilities. We
investigate the ability of T2I models to generate correct spatial relationships
among objects and present VISOR, an evaluation metric that captures how
accurately the spatial relationship described in text is generated in the
image. To benchmark existing models, we introduce a large-scale challenge
dataset SR2D that contains sentences describing two objects and the spatial
relationship between them. We construct and harness an automated evaluation
pipeline that employs computer vision to recognize objects and their spatial
relationships, and we employ it in a large-scale evaluation of T2I models. Our
experiments reveal a surprising finding that, although recent state-of-the-art
T2I models exhibit high image quality, they are severely limited in their
ability to generate multiple objects or the specified spatial relations such as
left/right/above/below. Our analyses demonstrate several biases and artifacts
of T2I models such as the difficulty with generating multiple objects, a bias
towards generating the first object mentioned, spatially inconsistent outputs
for equivalent relationships, and a correlation between object co-occurrence
and spatial understanding capabilities. We conduct a human study that shows the
alignment between VISOR and human judgment about spatial understanding. We
offer the SR2D dataset and the VISOR metric to the community in support of T2I
spatial reasoning research.

本文研究基于大规模文本到图像合成 (T2I)，研究其中的空间理解能力，并提出了一个评估指标 VISOR，并引入一个大规模的数据集 SR2D 以及自动化评估管道，对 T2I 模型进行了大规模实验，发现其在多对象和空间关系生成方面存在严重限制和偏差，并提供了数据集和评估指标以支持 T2I 空间推理研究。