We present Unified-IO 2, the first autoregressive multimodal model that is
capable of understanding and generating image, text, audio, and action. To
unify different modalities, we tokenize inputs and outputs -- images, text,
audio, action, bounding boxes, etc., into a shared semantic space and then
process them with a single encoder-decoder transformer model. Since training
with such diverse modalities is challenging, we propose various architectural
improvements to stabilize model training. We train our model from scratch on a
large multimodal pre-training corpus from diverse sources with a multimodal
mixture of denoisers objective. To learn an expansive set of skills, such as
following multimodal instructions, we construct and finetune on an ensemble of
120 datasets with prompts and augmentations. With a single unified model,
Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and
strong results in more than 35 benchmarks, including image generation and
understanding, natural language understanding, video and audio understanding,
and robotic manipulation. We release all our models to the research community.

我们提出了 Unified-IO 2，这是第一个能够理解和生成图像、文本、音频和动作的自回归多模态模型。通过将输入和输出（图像、文本、音频、动作和边界框等）进行分词，在共享语义空间中统一不同的模态，并使用单个编码器 - 解码器变换器模型进行处理。通过从多样化来源的大型多模态预训练语料库中使用多模态混合去噪目标对模型进行从头训练，我们提出了各种架构改进来稳定模型训练。为了学习广泛的技能，如遵循多模态指令，我们构建并在包含提示和增强的 120 个数据集的集合上进行微调。通过一个统一的模型，Unified-IO 2 在 GRIT 基准测试中实现了最先进的性能，并在超过 35 个基准测试中取得了强大的结果，包括图像生成和理解、自然语言理解、视频和音频理解以及机器人操控。我们将所有模型发布给研究社区。