In this paper, we present StrucTexTv2, an effective document image
pre-training framework, by performing masked visual-textual prediction. It
consists of two self-supervised pre-training tasks: masked image modeling and
masked language modeling, based on text region-level image masking. The
proposed method randomly masks some image regions according to the bounding box
coordinates of text words. The objectives of our pre-training tasks are
reconstructing the pixels of masked image regions and the corresponding masked
tokens simultaneously. Hence the pre-trained encoder can capture more textual
semantics in comparison to the masked image modeling that usually predicts the
masked image patches. Compared to the masked multi-modal modeling methods for
document image understanding that rely on both the image and text modalities,
StrucTexTv2 models image-only input and potentially deals with more application
scenarios free from OCR pre-processing. Extensive experiments on mainstream
benchmarks of document image understanding demonstrate the effectiveness of
StrucTexTv2. It achieves competitive or even new state-of-the-art performance
in various downstream tasks such as image classification, layout analysis,
table structure recognition, document OCR, and information extraction under the
end-to-end scenario.

本文提出了一种名为 StrucTexTv2 的有效的文档图像预训练框架，通过执行掩码视觉 - 文本预测。它由两个自我监督的预训练任务组成：掩码图像建模和掩码语言建模，基于文本区域级别的图像掩码。经实验验证，该模型在文档图像理解的各个下游任务中均取得了具有竞争力甚至是最新的最佳性能。

StrucTexTv2: 遮蔽式视觉文本预测用于文档图像预训练

StrucTexTv2: Masked Visual-Textual Prediction for Document Image  Pre-training

We propose SelfDoc, a task-agnostic pre-training framework for document image
understanding. Because documents are multimodal and are intended for sequential
reading, our framework exploits the positional, textual, and visual information
of every semantically meaningful component in a document, and it models the
contextualization between each block of content. Unlike existing document
pre-training models, our model is coarse-grained instead of treating individual
words as input, therefore avoiding an overly fine-grained with excessive
contextualization. Beyond that, we introduce cross-modal learning in the model
pre-training phase to fully leverage multimodal information from unlabeled
documents. For downstream usage, we propose a novel modality-adaptive attention
mechanism for multimodal feature fusion by adaptively emphasizing language and
vision signals. Our framework benefits from self-supervised pre-training on
documents without requiring annotations by a feature masking training strategy.
It achieves superior performance on multiple downstream tasks with
significantly fewer document images used in the pre-training stage compared to
previous works.

SelfDoc 是一个文档图像理解的任务无关的预训练框架，利用文档的位置、文本和视觉信息，并建模内容块之间的上下文关系，提出了一种新的跨模态学习模型，优于现有模型，同时具有自适应的视觉语言融合机制并应用自监督模型预训练，与以前作品相比，使用较少的文档图片达到更好的性能。

SelfDoc: 自我监督文件表示学习

SelfDoc: Self-Supervised Document Representation Learning

Pre-training techniques have been verified successfully in a variety of NLP
tasks in recent years. Despite the widespread use of pre-training models for
NLP applications, they almost exclusively focus on text-level manipulation,
while neglecting layout and style information that is vital for document image
understanding. In this paper, we propose the \textbf{LayoutLM} to jointly model
interactions between text and layout information across scanned document
images, which is beneficial for a great number of real-world document image
understanding tasks such as information extraction from scanned documents.
Furthermore, we also leverage image features to incorporate words' visual
information into LayoutLM. To the best of our knowledge, this is the first time
that text and layout are jointly learned in a single framework for
document-level pre-training. It achieves new state-of-the-art results in
several downstream tasks, including form understanding (from 70.72 to 79.27),
receipt understanding (from 94.02 to 95.24) and document image classification
(from 93.07 to 94.42). The code and pre-trained LayoutLM models are publicly
available at https://aka.ms/layoutlm.

本文提出了用于扫描文档图像的 LayoutLM 模型，实现了文本和布局信息的联合学习，将其应用于信息提取等实际文档图像理解任务中，成果在多项下游任务中达到最新的技术水平，代码和预训练模型可公开获取。