Sanskrit is a classical language with about 30 million extant manuscripts fit
for digitisation, available in written, printed or scannedimage forms. However,
it is still considered to be a low-resource language when it comes to available
digital resources. In this work, we release a post-OCR text correction dataset
containing around 218,000 sentences, with 1.5 million words, from 30 different
books. Texts in Sanskrit are known to be diverse in terms of their linguistic
and stylistic usage since Sanskrit was the 'lingua franca' for discourse in the
Indian subcontinent for about 3 millennia. Keeping this in mind, we release a
multi-domain dataset, from areas as diverse as astronomy, medicine and
mathematics, with some of them as old as 18 centuries. Further, we release
multiple strong baselines as benchmarks for the task, based on pre-trained
Seq2Seq language models. We find that our best-performing model, consisting of
byte level tokenization in conjunction with phonetic encoding (Byt5+SLP1),
yields a 23% point increase over the OCR output in terms of word and character
error rates. Moreover, we perform extensive experiments in evaluating these
models on their performance and analyse common causes of mispredictions both at
the graphemic and lexical levels. Our code and dataset is publicly available at
this https URL.

在这项工作中，我们发布了一个后 OCR 文本校正数据集，其中包含来自 30 本不同书籍的约 218,000 个句子，共 1.5 百万个单词，涵盖了天文学、医学和数学等多个领域，其中一些可追溯到 18 个世纪。我们还发布了基于预训练 Seq2Seq 语言模型的多个强基线作为任务的基准。通过字节级标记和音素编码（Byt5+SLP1）的最佳模型，我们在单词和字符错误率方面取得了 23％的增加。

梵文 OCR 后文本校正的基准和数据集

A Benchmark and Dataset for Post-OCR text correction in Sanskrit

We propose a post-OCR text correction approach for digitising texts in
Romanised Sanskrit. Owing to the lack of resources our approach uses OCR models
trained for other languages written in Roman. Currently, there exists no
dataset available for Romanised Sanskrit OCR. So, we bootstrap a dataset of 430
images, scanned in two different settings and their corresponding ground truth.
For training, we synthetically generate training images for both the settings.
We find that the use of copying mechanism (Gu et al., 2016) yields a percentage
increase of 7.69 in Character Recognition Rate (CRR) than the current state of
the art model in solving monotone sequence-to-sequence tasks (Schnober et al.,
2016). We find that our system is robust in combating OCR-prone errors, as it
obtains a CRR of 87.01% from an OCR output with CRR of 35.76% for one of the
dataset settings. A human judgment survey performed on the models shows that
our proposed model results in predictions which are faster to comprehend and
faster to improve for a human than the other systems.

针对罗马化梵文文本数字化的后 OCR 文本纠正方法，使用其他罗马字母语言的 OCR 模型进行训练，通过合成数据生成来训练模型，并使用一种复制机制来提高字符识别率。实验结果表明，该模型在解决单调序列 - 序列任务方面比当前最先进的模型结果提高了 7.69%，能够有效地减少 OCR 产生的错误。此外，该模型的预测结果可被人类更快地理解和改进。

利用印刻梵文的 OCR 进行后 OCR 文本校正

Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised  Sanskrit

This paper explores the use of a learned classifier for post-OCR text
correction. Experiments with the Arabic language show that this approach, which
integrates a weighted confusion matrix and a shallow language model, improves
the vast majority of segmentation and recognition errors, the most frequent
types of error on our dataset.

该论文探讨了使用学习分类器进行后期 OCR 文本纠错的方法。阿拉伯语实验表明，这种方法结合加权混淆矩阵和浅层语言模型，可以改进我们数据集上绝大多数的分割和识别错误，这是最常见的类型。