In this paper, we present StrucTexTv2, an effective document image
pre-training framework, by performing masked visual-textual prediction. It
consists of two self-supervised pre-training tasks: masked image modeling and
masked language modeling, based on text region-level image masking. The
proposed method randomly masks some image regions according to the bounding box
coordinates of text words. The objectives of our pre-training tasks are
reconstructing the pixels of masked image regions and the corresponding masked
tokens simultaneously. Hence the pre-trained encoder can capture more textual
semantics in comparison to the masked image modeling that usually predicts the
masked image patches. Compared to the masked multi-modal modeling methods for
document image understanding that rely on both the image and text modalities,
StrucTexTv2 models image-only input and potentially deals with more application
scenarios free from OCR pre-processing. Extensive experiments on mainstream
benchmarks of document image understanding demonstrate the effectiveness of
StrucTexTv2. It achieves competitive or even new state-of-the-art performance
in various downstream tasks such as image classification, layout analysis,
table structure recognition, document OCR, and information extraction under the
end-to-end scenario.

本文提出了一种名为 StrucTexTv2 的有效的文档图像预训练框架，通过执行掩码视觉 - 文本预测。它由两个自我监督的预训练任务组成：掩码图像建模和掩码语言建模，基于文本区域级别的图像掩码。经实验验证，该模型在文档图像理解的各个下游任务中均取得了具有竞争力甚至是最新的最佳性能。