This paper presents a novel approach to enhance image-to-image generation by leveraging the multimodal capabilities of the Large Language and Vision Assistant (LLaVA). We propose a framework where LLaVA analyzes input images and generates textual descriptions, hereinafter LLaVA-generated prompts. These prompts, along with the original image, are fed into the image-to-image generation pipeline. This enriched representation guides the generation process towards outputs that exhibit a stronger resemblance to the input image. Extensive experiments demonstrate the effectiveness of LLaVA-generated prompts in promoting image similarity. We observe a significant improvement in the visual coherence between the generated and input images compared to traditional methods. Future work will explore fine-tuning LLaVA prompts for increased control over the creative process. By providing more specific details within the prompts, we aim to achieve a delicate balance between faithfulness to the original image and artistic expression in the generated outputs.

通过利用大型语言与视觉助手（LLaVA）的多模态能力，本文提出了一种增强图像-图像生成的新方法。LLaVA分析输入图像并生成文本描述，即LLaVA生成的提示。这些提示与原始图像一起输入到图像-图像生成流程中，丰富的表示指导生成过程以展现更强的输入图像相似性。广泛的实验证明了LLaVA生成的提示在促进图像相似性方面的有效性。与传统方法相比，我们观察到生成图像和输入图像之间视觉一致性的显著改进。未来的工作将探索对LLaVA提示进行微调，以更好地控制创造过程。通过在提示中提供更具体的细节，我们旨在在生成的输出中实现对原始图像的忠实性和艺术表现之间的微妙平衡。

利用LLaVA提示和负面提示提升图像生成