多模态生成：将语言模型与图像相结合

Jan, 2023

多模态生成：将语言模型与图像相结合

Grounding Language Models to Images for Multimodal Generation

Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried

TL;DR该研究提出了一种有效的方法，将预训练的纯文本语言模型转移到视觉领域，使其能够处理和生成任意交错的图像和文本数据，并在上下文图像检索和多模态对话等方面实现了强有力的效果。

Abstract

We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process and generate arbitrarily interleaved image-and-text data. Our method leverages the abil