We introduce NaSGEC, a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains. Previous CGEC research primarily focuses on correcting texts from a single domain, especially learner essays. To broaden the target domain, we annotate multiple references for 12,500 sentences from three native domains, i.e., social media, scientific writing, and examination. We provide solid benchmark results for NaSGEC by employing cutting-edge CGEC models and different training data. We further perform detailed analyses of the connections and gaps between our domains from both empirical and statistical views. We hope this work can inspire future studies on an important but under-explored direction--cross-domain GEC.

我们介绍了NaSGEC数据集，它是一个新的数据集，旨在为来自多个领域的母语者文本的汉语语法纠错（CGEC）研究提供便利。我们为来自社交媒体、科技写作和考试三个本土领域的12,500个句子注释了多个参考文本，并通过使用先进的CGEC模型和不同的训练数据，为NaSGEC提供了可靠的基准结果。我们进一步从经验和统计的角度对我们的领域之间的联系和差距进行了详细的分析。我们希望这项工作能够启发对一个重要但鲜为人知的方向进行未来的研究-跨领域GEC。

NaSGEC：一个来自于母语者文本的多领域中文语法错误数据集