We present an approach to pose object recognition as next token prediction.
The idea is to apply a language decoder that auto-regressively predicts the
text tokens from image embeddings to form labels. To ground this prediction
process in auto-regression, we customize a non-causal attention mask for the
decoder, incorporating two key features: modeling tokens from different labels
to be independent, and treating image tokens as a prefix. This masking
mechanism inspires an efficient method - one-shot sampling - to simultaneously
sample tokens of multiple labels in parallel and rank generated labels by their
probabilities during inference. To further enhance the efficiency, we propose a
simple strategy to construct a compact decoder by simply discarding the
intermediate blocks of a pretrained language model. This approach yields a
decoder that matches the full model's performance while being notably more
efficient. The code is available at this https URL

通过将图像嵌入到文本令牌的自回归预测过程中，我们提出了一种将目标识别作为下一个令牌预测的方法。我们通过自定义非因果注意掩码来将预测过程与自回归相结合，其中包括将不同标签的令牌建模为独立，并将图像令牌视为前缀。我们提出了一种高效的一次性采样方法来同时并行采样多个标签的令牌，并在推理过程中通过它们的概率对生成的标签进行排名。为了进一步提高效率，我们提出了一个简单的策略，通过简单丢弃预训练语言模型的中间块来构建一个紧凑的解码器。这种方法在保持整体模型性能的同时具有显著的效率优势。

目标识别作为下一个令牌预测

Object Recognition as Next Token Prediction

Image Captioning is a fundamental task to join vision and language,
concerning about cross-modal understanding and text generation. Recent years
witness the emerging attention on image captioning. Most of existing works
follow a traditional two-stage training paradigm. Before training the
captioning models, an extra object detector is utilized to recognize the
objects in the image at first. However, they require sizeable datasets with
fine-grained object annotation for training the object detector, which is a
daunting task. In addition, the errors of the object detectors are easy to
propagate to the following captioning models, degenerating models' performance.
To alleviate such defects, we propose a frustratingly simple but highly
effective end-to-end image captioning framework, Visual Conditioned GPT
(VC-GPT), by connecting the pre-trained visual encoder (CLIP-ViT) and language
decoder (GPT2). Different from the vanilla connection method that directly
inserts the cross-attention modules into GPT2, we come up with a self-ensemble
cross-modal fusion mechanism that comprehensively considers both the single-
and cross-modal knowledge. As a result, we do not need extra object detectors
for model training. Experimental results conducted on three popular image
captioning benchmarks (MSCOCO, Flickr30k and NoCaps) demonstrate that our
VC-GPT achieves either the best or the second-best performance across all
evaluation metrics over extensive baseline systems.

通过联接预训练的视觉编码器和语言解码器，提出了一种自组装的交叉模式融合机制，建立了一种朴素但高效的端到端形象字幕框架，名为 VC-GPT，不需要额外的物体探测器，非常好地解决了现有方法中可能存在的问题，验证结果显示 VC-GPT 完全超越了传统基线系统。