Tremendous progress has been made in recent years in developing better image
captioning models, yet most of them rely on a separate object detector to
extract regional features. Recent vision-language studies are shifting towards
the detector-free trend by leveraging grid representations for more flexible
model training and faster inference speed. However, such development is
primarily focused on image understanding tasks, and remains less investigated
for the caption generation task. In this paper, we are concerned with a
better-performing detector-free image captioning model, and propose a pure
vision transformer-based image captioning model, dubbed as ViTCAP, in which
grid representations are used without extracting the regional features. For
improved performance, we introduce a novel Concept Token Network (CTN) to
predict the semantic concepts and then incorporate them into the end-to-end
captioning. In particular, the CTN is built on the basis of a vision
transformer and is designed to predict the concept tokens through a
classification task, from which the rich semantic information contained greatly
benefits the captioning task. Compared with the previous detector-based models,
ViTCAP drastically simplifies the architectures and at the same time achieves
competitive performance on various challenging image captioning datasets. In
particular, ViTCAP reaches 138.1 CIDEr scores on COCO-caption Karpathy-split,
93.8 and 108.6 CIDEr scores on nocaps, and Google-CC captioning datasets,
respectively.

本文提出了一种全新的视觉变换器（Vision transformer-based）图像标题模型 ViTCAP，使用网格表示而不提取区域特征。为了提高性能，引入了基于 Vision Transformer 的概念令牌网络（CTN），可以高效地预测语义概念，并将其整合到端到端的标题生成中，从而实现竞争性能，同时简化了结构。