This paper introduces VLAP, a novel approach that bridges pretrained vision
models and large language models (LLMs) to make frozen LLMs understand the
visual world. VLAP transforms the embedding space of pretrained vision models
into the LLMs' word embedding space using a single linear layer for efficient
and general-purpose visual and language understanding. Specifically, we harness
well-established word embeddings to bridge two modality embedding spaces. The
visual and text representations are simultaneously assigned to a set of word
embeddings within pretrained LLMs by formulating the assigning procedure as an
optimal transport problem. We predict the assignment of one modality from the
representation of another modality data, enforcing consistent assignments for
paired multimodal data. This allows vision and language representations to
contain the same information, grounding the frozen LLMs' word embedding space
in visual data. Moreover, a robust semantic taxonomy of LLMs can be preserved
with visual data since the LLMs interpret and reason linguistic information
from correlations between word embeddings. Experimental results show that VLAP
achieves substantial improvements over the previous linear transformation-based
approaches across a range of vision-language tasks, including image captioning,
visual question answering, and cross-modal retrieval. We also demonstrate the
learned visual representations hold a semantic taxonomy of LLMs, making visual
semantic arithmetic possible.

该论文介绍了 VLAP（pretrained vision models 和 large language models 之间的视觉理解的桥梁），通过一种新颖的方法，将预训练的视觉模型的嵌入空间转化为大规模语言模型的词嵌入空间，从而有效且通用地实现视觉和语言的理解。

通过预测分配来建立视觉和语言空间的桥梁

Bridging Vision and Language Spaces with Assignment Prediction

Generative Adversarial Networks (GANs) have been widely used to recover vivid
textures in image super-resolution (SR) tasks. In particular, one discriminator
is utilized to enable the SR network to learn the distribution of real-world
high-quality images in an adversarial training manner. However, the
distribution learning is overly coarse-grained, which is susceptible to virtual
textures and causes counter-intuitive generation results. To mitigate this, we
propose the simple and effective Semantic-aware Discriminator (denoted as SeD),
which encourages the SR network to learn the fine-grained distributions by
introducing the semantics of images as a condition. Concretely, we aim to
excavate the semantics of images from a well-trained semantic extractor. Under
different semantics, the discriminator is able to distinguish the real-fake
images individually and adaptively, which guides the SR network to learn the
more fine-grained semantic-aware textures. To obtain accurate and abundant
semantics, we take full advantage of recently popular pretrained vision models
(PVMs) with extensive datasets, and then incorporate its semantic features into
the discriminator through a well-designed spatial cross-attention module. In
this way, our proposed semantic-aware discriminator empowered the SR network to
produce more photo-realistic and pleasing images. Extensive experiments on two
typical tasks, i.e., SR and Real SR have demonstrated the effectiveness of our
proposed methods.

我们提出了一种简单而有效的语义感知鉴别器，通过引入图像的语义作为条件，鉴别器能够个别且自适应地区分真实 - 伪造图像，从而引导超分辨网络学习细粒度的语义感知纹理，进而生成更照片般真实和令人愉悦的图像。

SeD: 图像超分辨率的语义感知鉴别器

SeD: Semantic-Aware Discriminator for Image Super-Resolution

The advent of large-scale training has produced a cornucopia of powerful
visual recognition models. However, generative models, such as GANs, have
traditionally been trained from scratch in an unsupervised manner. Can the
collective "knowledge" from a large bank of pretrained vision models be
leveraged to improve GAN training? If so, with so many models to choose from,
which one(s) should be selected, and in what manner are they most effective? We
find that pretrained computer vision models can significantly improve
performance when used in an ensemble of discriminators. Notably, the particular
subset of selected models greatly affects performance. We propose an effective
selection mechanism, by probing the linear separability between real and fake
samples in pretrained model embeddings, choosing the most accurate model, and
progressively adding it to the discriminator ensemble. Interestingly, our
method can improve GAN training in both limited data and large-scale settings.
Given only 10k training samples, our FID on LSUN Cat matches the StyleGAN2
trained on 1.6M images. On the full dataset, our method improves FID by 1.5x to
2x on cat, church, and horse categories of LSUN.

利用预训练计算机视觉模型的嵌入向量的线性可分性来选择最准确的子集，并以渐进式添加到鉴别器合集中，可以显著提升 GAN 训练的性能，在有限数据和大规模设置下都表现良好。