Image captioning is currently a challenging task that requires the ability to
both understand visual information and use human language to describe this
visual information in the image. In this paper, we propose an efficient way to
improve the image understanding ability of transformer-based method by
extending Object Relation Transformer architecture with Attention on Attention
mechanism. Experiments on the VieCap4H dataset show that our proposed method
significantly outperforms its original structure on both the public test and
private test of the Image Captioning shared task held by VLSP.

本研究提出了一种有效的方法来改进基于转换器的图像理解方法，通过在对象关系转换器结构上扩展 Attention on Attention 机制，实验表明我们的方法在 VLSP 举办的 Image Captioning 共享任务的公共测试和私人测试中显著优于原始结构。

VieCap4H-VLSP 2021：基于注意力机制的对象关系变换器在越南图片字幕生成中的性能优化

VieCap4H-VLSP 2021: ObjectAoA-Enhancing performance of Object Relation Transformer with Attention on Attention for Vietnamese image captioning

Image captioning models typically follow an encoder-decoder architecture
which uses abstract image feature vectors as input to the encoder. One of the
most successful algorithms uses feature vectors extracted from the region
proposals obtained from an object detector. In this work we introduce the
Object Relation Transformer, that builds upon this approach by explicitly
incorporating information about the spatial relationship between input detected
objects through geometric attention. Quantitative and qualitative results
demonstrate the importance of such geometric attention for image captioning,
leading to improvements on all common captioning metrics on the MS-COCO
dataset.

本文介绍了一种名为 “Object Relation Transformer” 的图像描述模型，该模型在编码器 - 解码器架构中显式地整合了有关输入检测对象的空间关系，以几何关注的方式建模。结果表明，这种几何关注对图像描述非常重要，并在 MS-COCO 数据集上的各种标准评估指标上均有改进。