In this paper, we introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and proficiency in reasoning, comprehension, and response inherent in Large Language Models (LLMs) to assist users in image generation and editing. Diverging from existing approaches where Multimodal Large Language Models (MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our LLMGA provides a detailed language generation prompt for precise control over SD. This not only augments LLM context understanding but also reduces noise in generation prompts, yields images with more intricate and precise content, and elevates the interpretability of the network. To this end, we curate a comprehensive dataset comprising prompt refinement, similar image generation, inpainting $\&$ outpainting, and visual question answering. Moreover, we propose a two-stage training scheme. In the first stage, we train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. In the second stage, we optimize SD to align with the MLLM's generation prompts. Additionally, we propose a reference-based restoration network to alleviate texture, brightness, and contrast disparities between generated and preserved regions during image editing. Extensive results show that LLMGA has promising generative capabilities and can enable wider applications in an interactive manner.

该研究介绍了一种基于多模态大型语言模型的生成助手（LLMGA），利用大型语言模型（LLM）中内在的知识和理解能力，帮助用户进行图像生成和编辑，通过精确控制生成提示实现对稳定扩散（SD）的控制，以提供更精细、准确的内容和更直观的网络解释性，同时还提出了一个两阶段的训练方案来优化SD的生成结果，并引入基于参考的恢复网络来减少图像编辑过程中生成区域与保留区域之间的纹理、亮度和对比度差异。广泛的实验结果表明，LLMGA具有很好的生成能力，并能以交互方式在更广泛的应用中发挥作用。

LLMGA: 基于多模态大型语言模型的生成助手