Document intelligence as a relatively new research topic supports many
business applications. Its main task is to automatically read, understand, and
analyze documents. However, due to the diversity of formats (invoices, reports,
forms, etc.) and layouts in documents, it is difficult to make machines
understand documents. In this paper, we present the GraphDoc, a multimodal
graph attention-based model for various document understanding tasks. GraphDoc
is pre-trained in a multimodal framework by utilizing text, layout, and image
information simultaneously. In a document, a text block relies heavily on its
surrounding contexts, accordingly we inject the graph structure into the
attention mechanism to form a graph attention layer so that each input node can
only attend to its neighborhoods. The input nodes of each graph attention layer
are composed of textual, visual, and positional features from semantically
meaningful regions in a document image. We do the multimodal feature fusion of
each node by the gate fusion layer. The contextualization between each node is
modeled by the graph attention layer. GraphDoc learns a generic representation
from only 320k unlabeled documents via the Masked Sentence Modeling task.
Extensive experimental results on the publicly available datasets show that
GraphDoc achieves state-of-the-art performance, which demonstrates the
effectiveness of our proposed method. The code is available at
this https URL

本文提出了一种基于多模态图注意力机制的图文结合的自动文档分析模型 (GraphDoc)，该模型利用文本、布局和图像信息进行多模态预训练，并通过门控融合层对每个节点进行多模态特征融合，通过图注意层建模每个节点之间的上下文关系，学习了仅使用 320k 未标注文档的通用表示， 在公共数据集上获得了最先进的性能。

基于图注意力网络的多模态预训练在文档理解中的应用

Multimodal Pre-training Based on Graph Attention Network for Document  Understanding

Multimodal pre-training with text, layout, and image has made significant
progress for Visually Rich Document Understanding (VRDU), especially the
fixed-layout documents such as scanned document images. While, there are still
a large number of digital documents where the layout information is not fixed
and needs to be interactively and dynamically rendered for visualization,
making existing layout-based pre-training approaches not easy to apply. In this
paper, we propose MarkupLM for document understanding tasks with markup
languages as the backbone, such as HTML/XML-based documents, where text and
markup information is jointly pre-trained. Experiment results show that the
pre-trained MarkupLM significantly outperforms the existing strong baseline
models on several document understanding tasks. The pre-trained model and code
will be publicly available at this https URL

本研究论文探讨了一种名为 MarkupLM 的预训练模型，它能够对 HTML/XML 等标记语言的文档进行理解和分析，相比现有的基于布局的预训练方法，在布局可交互和动态渲染的数字文档中有着更好的性能表现。实验证明，该预训练模型在多个文档理解任务上，比现有的强基线模型表现更优秀。