Image captioning aims to automatically generate a natural language
description of a given image, and most state-of-the-art models have adopted an
encoder-decoder framework. The framework consists of a convolution neural
network (CNN)-based image encoder that extracts region-based visual features
from the input image, and an recurrent neural network (RNN)-based caption
decoder that generates the output caption words based on the visual features
with the attention mechanism. Despite the success of existing studies, current
methods only model the co-attention that characterizes the inter-modal
interactions while neglecting the self-attention that characterizes the
intra-modal interactions. Inspired by the success of the Transformer model in
machine translation, here we extend it to a Multimodal Transformer (MT) model
for image captioning. Compared to existing image captioning approaches, the MT
model simultaneously captures intra- and inter-modal interactions in a unified
attention block. Due to the in-depth modular composition of such attention
blocks, the MT model can perform complex multimodal reasoning and output
accurate captions. Moreover, to further improve the image captioning
performance, multi-view visual features are seamlessly introduced into the MT
model. We quantitatively and qualitatively evaluate our approach using the
benchmark MSCOCO image captioning dataset and conduct extensive ablation
studies to investigate the reasons behind its effectiveness. The experimental
results show that our method significantly outperforms the previous
state-of-the-art methods. With an ensemble of seven models, our solution ranks
the 1st place on the real-time leaderboard of the MSCOCO image captioning
challenge at the time of the writing of this paper.

利用多模态 Transformer 模型并结合多视角视觉特征来进行图像描述，这种方法能够同时捕捉到图像内部和图像与文本之间的关系，相较于业内先前方法显著提升了效果，是图像描述任务的最新成果。