While Large Language Models (LLMs) are the dominant models for generative
tasks in language, they do not perform as well as diffusion models on image and
video generation. To effectively use LLMs for visual generation, one crucial
component is the visual tokenizer that maps pixel-space inputs to discrete
tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a
video tokenizer designed to generate concise and expressive tokens for both
videos and images using a common token vocabulary. Equipped with this new
tokenizer, we show that LLMs outperform diffusion models on standard image and
video generation benchmarks including ImageNet and Kinetics. In addition, we
demonstrate that our tokenizer surpasses the previously top-performing video
tokenizer on two more tasks: (1) video compression comparable to the
next-generation video codec (VCC) according to human evaluations, and (2)
learning effective representations for action recognition tasks.

通过引入 MAGVIT-v2 作为视觉分词器，本文展示了大型语言模型（LLMs）在图像和视频生成上优于扩散模型，并超越以前在视频压缩和动作识别任务中表现最佳的视频分词器。

语言模型领先于扩散 - 分词器是视觉生成的关键

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Recently, the remarkable advance of the Large Language Model (LLM) has
inspired researchers to transfer its extraordinary reasoning capability to data
across several modalities. The prevailing approaches primarily regard visual
input as the prompt and focus exclusively on optimizing the text generation
process conditioned upon vision content by a frozen LLM. Such an inequitable
treatment of vision and language heavily constrains the model's potential. In
this paper, we break through this limitation by representing both vision and
language in a unified representation. To this end, we craft a visual tokenizer
that translates the non-linguistic image into a sequence of discrete tokens
like a foreign language that LLM can read. The resulting visual tokens
encompass high-level semantics worthy of a word and also support dynamic
sequence length varying from the image content. Coped with this visual
tokenizer, the presented foundation model called LaVIT (Language-VIsion
Transformer) can handle both image and text indiscriminately under a unified
generative learning paradigm. Pre-trained on the web-scale image-text corpus,
LaVIT is empowered with impressive multi-modal comprehension capability. The
extensive experiments showcase that it outperforms existing models by a large
margin on downstream tasks. Our code and models will be available at
this https URL

最近，大型语言模型的显著进展激发了研究人员将其非凡的推理能力转移到多个模态的数据上。本文通过在统一的表示中同时表达视觉和语言，突破了仅以视觉内容作为提示并专注于优化文本生成过程的限制。通过一个将非语言图像转换为 LLM 可以阅读的一系列离散标记的视觉分词器，LaVIT (Language-VIsion Transformer) 可以在统一的生成学习范式下无差别地处理图像和文本。在网上规模图像 - 文本语料库上预训练的 LaVIT 具有令人印象深刻的多模态理解能力。广泛的实验表明，它在下游任务上的性能超过现有模型很多。我们的代码和模型将在此 https URL 上提供。

动态离散视觉标记的统一语言 - 视觉预训练

Unified Language-Vision Pretraining with Dynamic Discrete Visual  Tokenization

The success of language Transformers is primarily attributed to the pretext
task of masked language modeling (MLM), where texts are first tokenized into
semantically meaningful pieces. In this work, we study masked image modeling
(MIM) and indicate the advantages and challenges of using a semantically
meaningful visual tokenizer. We present a self-supervised framework iBOT that
can perform masked prediction with an online tokenizer. Specifically, we
perform self-distillation on masked patch tokens and take the teacher network
as the online tokenizer, along with self-distillation on the class token to
acquire visual semantics. The online tokenizer is jointly learnable with the
MIM objective and dispenses with a multi-stage training pipeline where the
tokenizer needs to be pre-trained beforehand. We show the prominence of iBOT by
achieving an 82.3% linear probing accuracy and an 87.8% fine-tuning accuracy
evaluated on ImageNet-1K. Beyond the state-of-the-art image classification
results, we underline emerging local semantic patterns, which helps the models
to obtain strong robustness against common corruptions and achieve leading
results on dense downstream tasks, eg., object detection, instance
segmentation, and semantic segmentation.

本研究研究了掩蔽图像建模，并指出在使用语义上有意义的视觉分词器时的优势和挑战，提出了一种自我监督的框架 iBOT，可以使用在线分词器执行掩蔽预测。研究表明，iBOT 可以获得显着的结果，并在图像分类和下游任务中获得领先的结果。