Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners. Developing such tools requires a large amount of parallel, annotated data, which is unavailable for most languages. Synthetic data generation is a common practice to overcome the scarcity of such data. However, it is not straightforward for morphologically rich languages like Turkish due to complex writing rules that require phonological, morphological, and syntactic information. In this work, we present a flexible and extensible synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules (a.k.a., writing rules) implemented through complex transformation functions. Using this pipeline, we derive 130,000 high-quality parallel sentences from professionally edited articles. Additionally, we create a more realistic test set by manually annotating a set of movie reviews. We implement three baselines formulating the task as i) neural machine translation, ii) sequence tagging, and iii) prefix tuning with a pretrained decoder-only model, achieving strong results. Furthermore, we perform exhaustive experiments on out-of-domain datasets to gain insights on the transferability and robustness of the proposed approaches. Our results suggest that our corpus, GECTurk, is high-quality and allows knowledge transfer for the out-of-domain setting. To encourage further research on Turkish GEC, we release our datasets, baseline models, and the synthetic data generation pipeline at https://github.com/GGLAB-KU/gecturk.

为了克服对大多数语言缺乏大量平行标注数据的问题，本研究介绍了一种灵活可扩展的合成数据生成流程，应用于土耳其语。通过复杂的转换函数，实现了20多个专业编辑语法和拼写规则的生成，从而得到了13万句高质量平行句子。使用神经机器翻译、序列标注和前缀调参等三种基线模型，取得了良好的结果，并对领域外数据集进行了详尽实验，获得有关所提方法的可迁移性和鲁棒性的深入见解。通过发布数据集、基线模型和合成数据生成流程，我们鼓励进一步研究土耳其语错误检测和纠正。

GECTurk：用于土耳其语的语法错误校正和检测数据集