OCR errors are common in digitised historical archives significantly affecting their usability and value. Generative Language Models (LMs) have shown potential for correcting these errors using the context provided by the corrupted text and the broader socio-cultural context, a process called Context Leveraging OCR Correction (CLOCR-C). However, getting sufficient training data for fine-tuning such models can prove challenging. This paper shows that fine-tuning a language model on synthetic data using an LM and using a character level Markov corruption process can significantly improve the ability to correct OCR errors. Models trained on synthetic data reduce the character error rate by 55% and word error rate by 32% over the base LM and outperform models trained on real data. Key findings include; training on under-corrupted data is better than over-corrupted data; non-uniform character level corruption is better than uniform corruption; More tokens-per-observation outperforms more observations for a fixed token budget. The outputs for this paper are a set of 8 heuristics for training effective CLOCR-C models, a dataset of 11,000 synthetic 19th century newspaper articles and scrambledtext a python library for creating synthetic corrupted data.

本研究针对数字历史档案中的OCR错误进行修正，提升其可用性和价值。论文提出了一种名为上下文利用OCR纠正（CLOCR-C）的新方法，通过使用合成数据进行语言模型的微调，显著提高了纠正OCR错误的能力，测试结果显示字符错误率降低了55%，单词错误率降低了32%。

混乱文本：利用合成数据训练语言模型以纠正OCR错误