In this paper, we introduce a Multimodal Large Language Model-based
Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and
proficiency in reasoning, comprehension, and response inherent in Large
Language Models (LLMs) to assist users in image generation and editing.
Diverging from existing approaches where Multimodal Large Language Models
(MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our
LLMGA provides a detailed language generation prompt for precise control over
SD. This not only augments LLM context understanding but also reduces noise in
generation prompts, yields images with more intricate and precise content, and
elevates the interpretability of the network. To this end, we curate a
comprehensive dataset comprising prompt refinement, similar image generation,
inpainting $\&$ outpainting, and visual question answering. Moreover, we
propose a two-stage training scheme. In the first stage, we train the MLLM to
grasp the properties of image generation and editing, enabling it to generate
detailed prompts. In the second stage, we optimize SD to align with the MLLM's
generation prompts. Additionally, we propose a reference-based restoration
network to alleviate texture, brightness, and contrast disparities between
generated and preserved regions during image editing. Extensive results show
that LLMGA has promising generative capabilities and can enable wider
applications in an interactive manner.

该研究介绍了一种基于多模态大型语言模型的生成助手（LLMGA），利用大型语言模型（LLM）中内在的知识和理解能力，帮助用户进行图像生成和编辑，通过精确控制生成提示实现对稳定扩散（SD）的控制，以提供更精细、准确的内容和更直观的网络解释性，同时还提出了一个两阶段的训练方案来优化 SD 的生成结果，并引入基于参考的恢复网络来减少图像编辑过程中生成区域与保留区域之间的纹理、亮度和对比度差异。广泛的实验结果表明，LLMGA 具有很好的生成能力，并能以交互方式在更广泛的应用中发挥作用。

LLMGA: 基于多模态大型语言模型的生成助手

LLMGA: Multimodal Large Language Model based Generation Assistant

Denoising diffusion probabilistic models (DDPMs) are expressive generative
models that have been used to solve a variety of speech synthesis problems.
However, because of their high sampling costs, DDPMs are difficult to use in
real-time speech processing applications. In this paper, we introduce
DiffGAN-TTS, a novel DDPM-based text-to-speech (TTS) model achieving
high-fidelity and efficient speech synthesis. DiffGAN-TTS is based on denoising
diffusion generative adversarial networks (GANs), which adopt an
adversarially-trained expressive model to approximate the denoising
distribution. We show with multi-speaker TTS experiments that DiffGAN-TTS can
generate high-fidelity speech samples within only 4 denoising steps. We present
an active shallow diffusion mechanism to further speed up inference. A
two-stage training scheme is proposed, with a basic TTS acoustic model trained
at stage one providing valuable prior information for a DDPM trained at stage
two. Our experiments show that DiffGAN-TTS can achieve high synthesis
performance with only 1 denoising step.

本文介绍了一种新的基于 Denoising Diffusion Generative Adversarial Networks 的文本到语音模型 (DiffGAN-TTS)，该模型通过多说话人 TTS 实验表明，仅需 4 个去噪步骤即可生成高保真度语音样本，并提出了一个两阶段训练方案，可在仅 1 个去噪步骤下实现高质量的语音合成性能。