We study the new problem of automatic question generation (QG) from
multi-modal sources containing images and texts, significantly expanding the
scope of most of the existing work that focuses exclusively on QG from only
textual sources. We propose a simple solution for our new problem, called
MultiQG-TI, which enables a text-only question generator to process visual
input in addition to textual input. Specifically, we leverage an image-to-text
model and an optical character recognition model to obtain the textual
description of the image and extract any texts in the image, respectively, and
then feed them together with the input texts to the question generator. We only
fine-tune the question generator while keeping the other components fixed. On
the challenging ScienceQA dataset, we demonstrate that MultiQG-TI significantly
outperforms ChatGPT with few-shot prompting, despite having hundred-times less
trainable parameters. Additional analyses empirically confirm the necessity of
both visual and textual signals for QG and show the impact of various modeling
choices.

我们研究了从多模态源（包含图像和文本）中自动生成问题（QG）的新问题，明显扩展了现有工作的范围，后者仅关注从文本源生成的 QG。我们提出了一个简单的解决方案，名为 MultiQG-TI，它使得仅基于文本的问题生成器能够处理视觉输入。我们通过利用图像到文本模型和光学字符识别模型来获得图像的文本描述并提取图像中的任何文本，然后将它们与输入文本一起馈送给问题生成器。在具有挑战性的 ScienceQA 数据集上，我们证明了 MultiQG-TI 在几次提示下明显优于 ChatGPT，尽管 MultiQG-TI 的可训练参数数量是 ChatGPT 的百倍少。额外的分析实验证实了 QG 所需的视觉和文本信号的必要性，并展示了各种建模选择的影响。

多模式考量下的问题生成技术研究

MultiQG-TI: Towards Question Generation from Multi-modal Sources

Visually-situated language is ubiquitous -- sources range from textbooks with
diagrams to web pages with images and tables, to mobile apps with buttons and
forms. Perhaps due to this diversity, previous work has typically relied on
domain-specific recipes with limited sharing of the underlying data, model
architectures, and objectives. We present Pix2Struct, a pretrained
image-to-text model for purely visual language understanding, which can be
finetuned on tasks containing visually-situated language. Pix2Struct is
pretrained by learning to parse masked screenshots of web pages into simplified
HTML. The web, with its richness of visual elements cleanly reflected in the
HTML structure, provides a large source of pretraining data well suited to the
diversity of downstream tasks. Intuitively, this objective subsumes common
pretraining signals such as OCR, language modeling, image captioning. In
addition to the novel pretraining strategy, we introduce a variable-resolution
input representation and a more flexible integration of language and vision
inputs, where language prompts such as questions are rendered directly on top
of the input image. For the first time, we show that a single pretrained model
can achieve state-of-the-art results in six out of nine tasks across four
domains: documents, illustrations, user interfaces, and natural images.

Pix2Struct 是一种预先训练的图像到文本模型，能够解析丰富的文本，可用于多个领域任务，实现了最先进的结果。