Many high-level skills that are required for computer vision tasks, such as parsing questions, comparing and contrasting semantics, and writing descriptions, are also required in other domains such as natural language processing. In this paper, we ask whether this makes it possible to learn those skills from text data and then use them to complete vision tasks without ever training on visual training data. Key to our approach is exploiting the joint embedding space of contrastively trained vision and language encoders. In practice, there can be systematic differences between embedding spaces for different modalities in contrastive models, and we analyze how these differences affect our approach and study a variety of strategies to mitigate this concern. We produce models using only text training data on three tasks: image captioning, visual entailment and visual question answering, and evaluate them on standard benchmarks using images. We find that this kind of transfer is possible and results in only a small drop in performance relative to models trained on images. We also showcase a variety of stylistic image captioning models that were trained using no image data and no human-curated language data, but instead text data from books, the web, or language models.

本文探讨了如何从文本数据中学习计算机视觉领域所需的高层次技能，并将其转移到视觉任务中，同时提出探究对比模型嵌入空间中不同模态的系统差异，进一步理解和缓解这种关注的策略。实践证明，我们使用仅文本训练数据在图像标注、视觉蕴含、视觉问题回答和视觉新闻等四个代表性任务上建立的模型，性能表现接近仅使用图像训练数据建立的模型，尤其是针对图像标注和视觉蕴含任务的文本训练数据，有望超过9个百分点的提升。同时，我们还展示了多种样式的图像标注模型，这些模型使用的不是图像数据和人工策划的语言数据，而是来自于图书、网络或语言模型可用的文本数据。

只用语言数据学习视觉任务，竟然没有图像！