Transformer-based Language Models are widely used in Natural Language Processing related tasks. Thanks to their pre-training, they have been successfully adapted to Information Extraction in business documents. However, most pre-training tasks proposed in the literature for business documents are too generic and not sufficient to learn more complex structures. In this paper, we use LayoutLM, a language model pre-trained on a collection of business documents, and introduce two new pre-training tasks that further improve its capacity to extract relevant information. The first is aimed at better understanding the complex layout of documents, and the second focuses on numeric values and their order of magnitude. These tasks force the model to learn better-contextualized representations of the scanned documents. We further introduce a new post-processing algorithm to decode BIESO tags in Information Extraction that performs better with complex entities. Our method significantly improves extraction performance on both public (from 93.88 to 95.50 F1 score) and private (from 84.35 to 84.84 F1 score) datasets composed of expense receipts, invoices, and purchase orders.

在这篇论文中，我们使用了一种预先训练在商业文件集合上的语言模型LayoutLM，并引入了两个新的预训练任务，进一步提高其提取相关信息的能力。第一个任务旨在更好地理解文档的复杂布局，第二个任务侧重于数字值及其数量级。通过这些任务，模型可以学习到更好上下文化的扫描文档表示。我们还引入了一种新的后处理算法，用于解码信息提取中的BIESO标签，对于复杂实体的处理效果更好。我们的方法显著提高了对公共数据集（从93.88提高到95.50 F1得分）和私有数据集（从84.35提高到84.84 F1得分）中的支出收据、发票和采购订单的提取性能。

使用特定的预训练任务提高商业文件信息提取