The recent large-scale vision-language pre-training (VLP) of dual-stream
architectures (e.g., CLIP) with a tremendous amount of image-text pair data,
has shown its superiority on various multimodal alignment tasks. Despite its
success, the resulting models are not capable of multimodal generative tasks
due to the weak text encoder. To tackle this problem, we propose to augment the
dual-stream VLP model with a textual pre-trained language model (PLM) via
vision-language knowledge distillation (VLKD), enabling the capability for
multimodal generation. VLKD is pretty data- and computation-efficient compared
to the pre-training from scratch. Experimental results show that the resulting
model has strong zero-shot performance on multimodal generation tasks, such as
open-ended visual question answering and image captioning. For example, it
achieves 44.5% zero-shot accuracy on the VQAv2 dataset, surpassing the previous
state-of-the-art zero-shot model with $7\times$ fewer parameters. Furthermore,
the original textual language understanding and generation ability of the PLM
is maintained after VLKD, which makes our model versatile for both multimodal
and unimodal tasks.

通过视觉 - 语言知识蒸馏 (VLKD) 增强双流 VLP 模型，使其具有多模态生成能力，实现开放式视觉问答和图像字幕等多模态生成任务的强零 - shot 性能。

通过视觉语言知识蒸馏实现在 CLIP 上的多模态生成

Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation

A method for creating a vision-and-language (V&L) model is to extend a
language model through structural modifications and V&L pre-training. Such an
extension aims to make a V&L model inherit the capability of natural language
understanding (NLU) from the original language model. To see how well this is
achieved, we propose to evaluate V&L models using an NLU benchmark (GLUE). We
compare five V&L models, including single-stream and dual-stream models,
trained with the same pre-training. Dual-stream models, with their higher
modality independence achieved by approximately doubling the number of
parameters, are expected to preserve the NLU capability better. Our main
finding is that the dual-stream scores are not much different than the
single-stream scores, contrary to expectation. Further analysis shows that
pre-training causes the performance drop in NLU tasks with few exceptions.
These results suggest that adopting a single-stream structure and devising the
pre-training could be an effective method for improving the maintenance of
language knowledge in V&L extensions.

本研究提出采用基于结构扩展和预训练技术的方法来创建一个视觉语言模型，通过评估 GLUE 基准测试，比较单流和双流模型的表现，结果表明，单流结构在保持语言知识方面比双流更有效。