Recently, large-scale diffusion models, e.g., Stable diffusion and DallE2,
have shown remarkable results on image synthesis. On the other hand,
large-scale cross-modal pre-trained models (e.g., CLIP, ALIGN, and FILIP) are
competent for various downstream tasks by learning to align vision and language
embeddings. In this paper, we explore the possibility of jointly modeling
generation and discrimination. Specifically, we propose DiffDis to unify the
cross-modal generative and discriminative pretraining into one single framework
under the diffusion process. DiffDis first formulates the image-text
discriminative problem as a generative diffusion process of the text embedding
from the text encoder conditioned on the image. Then, we propose a novel
dual-stream network architecture, which fuses the noisy text embedding with the
knowledge of latent images from different scales for image-text discriminative
learning. Moreover, the generative and discriminative tasks can efficiently
share the image-branch network structure in the multi-modality model.
Benefiting from diffusion-based unified training, DiffDis achieves both better
generation ability and cross-modal semantic alignment in one architecture.
Experimental results show that DiffDis outperforms single-task models on both
the image generation and the image-text discriminative tasks, e.g., 1.65%
improvement on average accuracy of zero-shot classification over 12 datasets
and 2.42 improvement on FID of zero-shot image synthesis.

在这篇论文中，我们提出了 DiffDis，通过扩展扩散过程，将跨模态生成和辨别预训练统一到一个框架中。DiffDis 通过融合噪声文本嵌入和来自不同尺度的潜在图像的知识，提出了一种新颖的双流网络架构，来解决图像 - 文本辨别任务。通过基于扩散的统一训练，DiffDis 在一种体系结构中实现了更好的生成能力和跨模态语义对齐。实验结果表明，DiffDis 在图像生成和图像 - 文本辨别任务上优于单一任务模型，例如在 12 个数据集上的零样本分类的平均准确性提高了 1.65％，在零样本图像合成的 FID 上提高了 2.42 个点。

DiffDis：将生成式扩散模型赋能跨模态辨别能力

DiffDis: Empowering Generative Diffusion Model with Cross-Modal  Discrimination Capability

While many BERT-based cross-modal pre-trained models produce excellent
results on downstream understanding tasks like image-text retrieval and VQA,
they cannot be applied to generation tasks directly. In this paper, we propose
XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning
that is designed to pre-train text-to-image caption generators through three
novel generation tasks, including Image-conditioned Masked Language Modeling
(IMLM), Image-conditioned Denoising Autoencoding (IDA), and Text-conditioned
Image Feature Generation (TIFG). As a result, the pre-trained XGPT can be
fine-tuned without any task-specific architecture modifications to create
state-of-the-art models for image captioning. Experiments show that XGPT
obtains new state-of-the-art results on the benchmark datasets, including COCO
Captions and Flickr30k Captions. We also use XGPT to generate new image
captions as data augmentation for the image retrieval task and achieve
significant improvement on all recall metrics.

这篇论文提出了一种新的跨模态生成预训练方法 XGPT，用于图像字幕生成，其能够在不需要特定任务架构修改的情况下进行微调，实验证明在基准数据集上获得了新的最佳结果，并且在图像检索任务中作为数据增强产生了显著的进步。