We introduce a novel sequential modeling approach which enables learning a
Large Vision Model (LVM) without making use of any linguistic data. To do this,
we define a common format, "visual sentences", in which we can represent raw
images and videos as well as annotated data sources such as semantic
segmentations and depth reconstructions without needing any meta-knowledge
beyond the pixels. Once this wide variety of visual data (comprising 420
billion tokens) is represented as sequences, the model can be trained to
minimize a cross-entropy loss for next token prediction. By training across
various scales of model architecture and data diversity, we provide empirical
evidence that our models scale effectively. Many different vision tasks can be
solved by designing suitable visual prompts at test time.

我们引入了一种新颖的顺序建模方法，可以学习大规模视觉模型（LVM）而无需使用任何语言数据。通过将原始图像、视频以及注解数据源转化为 “视觉句子” 的公共格式，我们可以表示各种视觉数据，并通过训练模型来解决多个视觉任务。