The creation of high-quality human-labeled image-caption datasets presents a
significant bottleneck in the development of Visual-Language Models (VLMs). We
propose a novel approach that leverages the strengths of Large Language Models
(LLMs) and image generation models to create synthetic image-text pairs for
efficient and effective VLM training. Our method employs pretraining a
text-to-image model to synthesize image embeddings starting from captions
generated by an LLM. These synthetic pairs are then used to train a VLM.
Extensive experiments demonstrate that the VLM trained with synthetic data
exhibits comparable performance on image captioning, while requiring a fraction
of the data used by models trained solely on human-annotated data. In
particular, we outperform the baseline by 17% through augmentation with a
synthetic dataset. Furthermore, we show that synthesizing in the image
embedding space is 25% faster than in the pixel space. This research introduces
a promising technique for generating large-scale, customizable image datasets,
leading to enhanced VLM performance and wider applicability across various
domains, all with improved data efficiency and resource utilization.

我们提出了一种利用大语言模型（LLM）和图像生成模型的优点来创建合成图像 - 文本对的新方法，以用于视觉语言模型（VLM）的高效训练。通过预训练一个文本到图像模型来合成由 LLM 生成的图像嵌入，我们的方法能够用合成数据训练出仅需使用人工标注数据一小部分的 VLM，并在图像字幕任务中展现出相当的性能。此研究为生成大规模、可定制的图像数据集引入了一项有前景的技术，从而提升了 VLM 性能，拓展了其在各个领域中的适用性，并改善了数据效率和资源利用。

Synth$^2$: 用合成标题和图像嵌入提升视觉 - 语言模型

Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and  Image Embeddings

Although image captioning models have made significant advancements in recent
years, the majority of them heavily depend on high-quality datasets containing
paired images and texts which are costly to acquire. Previous works leverage
the CLIP's cross-modal association ability for image captioning, relying solely
on textual information under unsupervised settings. However, not only does a
modality gap exist between CLIP text and image features, but a discrepancy also
arises between training and inference due to the unavailability of real-world
images, which hinders the cross-modal alignment in text-only captioning. This
paper proposes a novel method to address these issues by incorporating
synthetic image-text pairs. A pre-trained text-to-image model is deployed to
obtain images that correspond to textual data, and the pseudo features of
generated images are optimized toward the real ones in the CLIP embedding
space. Furthermore, textual information is gathered to represent image
features, resulting in the image features with various semantics and the
bridged modality gap. To unify training and inference, synthetic image features
would serve as the training prefix for the language decoder, while real images
are used for inference. Additionally, salient objects in images are detected as
assistance to enhance the learning of modality alignment. Experimental results
demonstrate that our method obtains the state-of-the-art performance on
benchmark datasets.

本研究提出了一种新方法，通过结合合成的图像文本对来解决图像标注中存在的跨模态对齐问题。通过使用预训练的文本到图像模型生成图像，并优化合成图像在 CLIP 嵌入空间中的伪特征以接近真实图像特征，同时利用图像中的显著对象来增强模态对齐的学习。实验证明，该方法在基准数据集上取得了最先进的性能。

通过合成对的方法改善文本式图像描述的跨模态对齐

Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image  Captioning

Text-only Image Captioning (TIC) is an approach that aims to construct a
model solely based on text that can accurately describe images. Recently,
diffusion models have demonstrated remarkable capabilities in generating
high-quality images that are semantically coherent with given texts. This
presents an opportunity to generate synthetic training images for TIC. However,
we have identified a challenge that the images generated from simple
descriptions typically exhibit a single perspective with one or limited
contexts, which is not aligned with the complexity of real-world scenes in the
image domain. In this paper, we propose a novel framework that addresses this
issue by introducing multi-context data generation. Starting with an initial
text corpus, our framework employs a large language model to select multiple
sentences that describe the same scene from various perspectives. These
sentences are then summarized into a single sentence with multiple contexts. We
generate simple images using the straightforward sentences and complex images
using the summarized sentences through diffusion models. Finally, we train the
model exclusively using the synthetic image-text pairs obtained from this
process. Experimental results demonstrate that our proposed framework
effectively tackles the central challenge we have identified, achieving the
state-of-the-art performance on popular datasets such as MSCOCO, Flickr30k, and
SS1M.

本文提出了一种新的多情景数据生成框架用于提高文本图像注释的训练数据，该框架使用扩散模型生成复杂和简单图像，并在 MSCOCO、Flickr30k 和 SS1M 等数据集上实现了最先进的表现。