The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities. We propose closing this gap with i-Code V2, the first model capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 is an integrative system that leverages state-of-the-art single-modality encoders, combining their outputs with a new modality-fusing encoder in order to flexibly project combinations of modalities into a shared representational space. Next, language tokens are generated from these representations via an autoregressive decoder. The whole framework is pretrained end-to-end on a large collection of dual- and single-modality datasets using a novel text completion objective that can be generalized across arbitrary combinations of modalities. i-Code V2 matches or outperforms state-of-the-art single- and dual-modality baselines on 7 multimodal tasks, demonstrating the power of generative multimodal pretraining across a diversity of tasks and signals.

文章提出了i-Code V2，这是第一个能够从任何视觉、语言和语音数据组合中生成自然语言的模型，它通过利用最先进的单模态编码器将各类模态组合并映射到一个共享表征空间，并使用自回归解码器从这些表征中生成语言词汇。i-Code V2在大量数据集上进行端到端预训练，通过文本补全目标实现泛化在任意模态组合上，展示出了多模态预训练在各种任务和信号方面的强大性能。

i-Code V2：基于视觉、语言和语音数据的自回归生成框架