While text-to-image models have achieved impressive capabilities in image generation and editing, their application across various modalities often necessitates training separate models. Inspired by existing method of single image editing with self attention injection and video editing with shared attention, we propose a novel unified editing framework that combines the strengths of both approaches by utilizing only a basic 2D image text-to-image (T2I) diffusion model. Specifically, we design a sampling method that facilitates editing consecutive images while maintaining semantic consistency utilizing shared self-attention features during both reference and consecutive image sampling processes. Experimental results confirm that our method enables editing across diverse modalities including 3D scenes, videos, and panorama images.

通过利用基本的二维图像文本到图像扩散模型，我们提出了一种新颖的统一编辑框架，结合了单一图像注入自注意力的编辑和共享注意力的视频编辑的优势，通过共享自注意力特征在参考和连续图像采样过程中，设计了一种采样方法，以在保持语义一致性的同时实现连续图像的编辑。实验结果表明，我们的方法能够在多种模态包括3D场景、视频和全景图像中进行编辑。

广角、3D场景和视频的统一编辑通过解耦的自注意注入