BriefGPT.xyz
Feb, 2022
OCR-IDL: 行业文档库数据集的OCR注释
OCR-IDL: OCR Annotations for Industry Document Library Dataset
HTML
PDF
Ali Furkan Biten, Rubèn Tito, Lluis Gomez, Ernest Valveny, Dimosthenis Karatzas
TL;DR
本研究提出一种基于商业OCR引擎的OCR标注数据集(OCR-IDL),来解决现有预训练方法中不同OCR引擎使用数据不一致的问题,该数据集的价值约为20,000美元,可用于未来的文档智能研究。
Abstract
pretraining
has proven successful in
document intelligence
tasks where deluge of documents are used to pretrain the models only later to be finetuned on downstream tasks. One of the problems of the
→