Text-based image captioning (TextCap) requires simultaneous comprehension of
visual content and reading the text of images to generate a natural language
description. Although a task can teach machines to understand the complex human
environment further given that text is omnipresent in our daily surroundings,
it poses additional challenges in normal captioning. A text-based image
intuitively contains abundant and complex multimodal relational content, that
is, image details can be described diversely from multiview rather than a
single caption. Certainly, we can introduce additional paired training data to
show the diversity of images' descriptions, this process is labor-intensive and
time-consuming for TextCap pair annotations with extra texts. Based on the
insight mentioned above, we investigate how to generate diverse captions that
focus on different image parts using an unpaired training paradigm. We propose
the Multimodal relAtional Graph adversarIal inferenCe (MAGIC) framework for
diverse and unpaired TextCap. This framework can adaptively construct multiple
multimodal relational graphs of images and model complex relationships among
graphs to represent descriptive diversity. Moreover, a cascaded generative
adversarial network is developed from modeled graphs to infer the unpaired
caption generation in image-sentence feature alignment and linguistic coherence
levels. We validate the effectiveness of MAGIC in generating diverse captions
from different relational information items of an image. Experimental results
show that MAGIC can generate very promising outcomes without using any
image-caption training pairs.

研究了如何利用 unpaired training paradigm 生成多样化的文字图像描述，提出了 Multimodal relAtional Graph adversarIal inferenCe (MAGIC) 框架，并使用一种级联生成对抗网络从 multimodal graphs 中推断出相关联的多样化图像描述。

MAGIC: 多模态关系图对抗推理，用于不同和不配对的基于文本的图像标题

MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and  Unpaired Text-based Image Captioning

Text-based image captioning (TextCap) which aims to read and reason images
with texts is crucial for a machine to understand a detailed and complex scene
environment, considering that texts are omnipresent in daily life. This task,
however, is very challenging because an image often contains complex texts and
visual information that is hard to be described comprehensively. Existing
methods attempt to extend the traditional image captioning methods to solve
this task, which focus on describing the overall scene of images by one global
caption. This is infeasible because the complex text and visual information
cannot be described well within one caption. To resolve this difficulty, we
seek to generate multiple captions that accurately describe different parts of
an image in detail. To achieve this purpose, there are three key challenges: 1)
it is hard to decide which parts of the texts of images to copy or paraphrase;
2) it is non-trivial to capture the complex relationship between diverse texts
in an image; 3) how to generate multiple captions with diverse content is still
an open problem. To conquer these, we propose a novel Anchor-Captioner method.
Specifically, we first find the important tokens which are supposed to be paid
more attention to and consider them as anchors. Then, for each chosen anchor,
we group its relevant texts to construct the corresponding anchor-centred graph
(ACG). Last, based on different ACGs, we conduct multi-view caption generation
to improve the content diversity of generated captions. Experimental results
show that our method not only achieves SOTA performance but also generates
diverse captions to describe images.

本文提出了一种基于锚文本和锚中心图的多视角多解释图像说明生成方法，以提高生成说明的多样性和准确性。

通过内容多样性探索实现准确的基于文本的图像描述

Towards Accurate Text-based Image Captioning with Content Diversity  Exploration

Texts appearing in daily scenes that can be recognized by OCR (Optical
Character Recognition) tools contain significant information, such as street
name, product brand and prices. Two tasks -- text-based visual question
answering and text-based image captioning, with a text extension from existing
vision-language applications, are catching on rapidly. To address these
problems, many sophisticated multi-modality encoding frameworks (such as
heterogeneous graph structure) are being used. In this paper, we argue that a
simple attention mechanism can do the same or even better job without any bells
and whistles. Under this mechanism, we simply split OCR token features into
separate visual- and linguistic-attention branches, and send them to a popular
Transformer decoder to generate answers or captions. Surprisingly, we find this
simple baseline model is rather strong -- it consistently outperforms
state-of-the-art (SOTA) models on two popular benchmarks, TextVQA and all three
tasks of ST-VQA, although these SOTA models use far more complex encoding
mechanisms. Transferring it to text-based image captioning, we also surpass the
TextCaps Challenge 2020 winner. We wish this work to set the new baseline for
this two OCR text related applications and to inspire new thinking of
multi-modality encoder design. Code is available at
this https URL

本篇论文提出了一种简单的关注机制，通过将 OCR 令牌特征分别发送到可视化和语言关注分支，并将它们发送到流行的 Transformer 解码器以生成答案或标题，从而在 TextVQA 和 ST-VQA 等多个基准测试上取得最新的最佳表现，并且在文本图像字幕方面超过了 TextCaps 挑战 2020 的获胜者