基于视觉丰富的文档提取模型数据标注成本的显著降低

Oct, 2022

基于视觉丰富的文档提取模型数据标注成本的显著降低

Radically Lower Data-Labeling Costs for Visually Rich Document Extraction Models

Yichao Zhou, James B. Wendt, Navneet Potti, Jing Xie, Sandeep Tata

TL;DR提出使用选择性标注结合主动学习的方法，以简化对可预测提取的样本进行标注的成本，实验证明相比全额标注，该方法可将成本降低10倍同时精度不受影响，并且适用于不同领域的文档。

Abstract

A key bottleneck in building automatic extraction models for visually rich documents like invoices is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model wit