Some grammatical error correction (GEC) systems incorporate hand-crafted
rules and achieve positive results. However, manually defining rules is
time-consuming and laborious. In view of this, we propose a method to mine
error templates for GEC automatically. An error template is a regular
expression aiming at identifying text errors. We use the web crawler to acquire
such error templates from the Internet. For each template, we further select
the corresponding corrective action by using the language model perplexity as a
criterion. We have accumulated 1,119 error templates for Chinese GEC based on
this method. Experimental results on the newly proposed CTC-2021 Chinese GEC
benchmark show that combing our error templates can effectively improve the
performance of a strong GEC system, especially on two error types with very
little training data. Our error templates are available at
https://github.com/HillZhang1999/gec_error_template.

该研究提出基于网络爬虫自动挖掘错误模板的语法纠错技术，其中错误模板是一种用于识别文本错误的正则表达式，利用语言模型困惑度作为评价标准选择相应的修正行为。实验结果表明，结合错误模板的强 GEC 系统的性能可以有效提高，特别是在很少训练数据的情况下的两类错误。

挖掘错误模板以进行语法错误修正

Mining Error Templates for Grammatical Error Correction

The pervasiveness of intra-utterance code-switching (CS) in spoken content
requires that speech recognition (ASR) systems handle mixed language. Designing
a CS-ASR system has many challenges, mainly due to data scarcity, grammatical
structure complexity, and domain mismatch. The most common method for
addressing CS is to train an ASR system with the available transcribed CS
speech, along with monolingual data. In this work, we propose a zero-shot
learning methodology for CS-ASR by augmenting the monolingual data with
artificially generating CS text. We based our approach on random lexical
replacements and Equivalence Constraint (EC) while exploiting aligned
translation pairs to generate random and grammatically valid CS content. Our
empirical results show a 65.5% relative reduction in language model perplexity,
and 7.7% in ASR WER on two ecologically valid CS test sets. The human
evaluation of the generated text using EC suggests that more than 80% is of
adequate quality.

本文基于随机词汇替换和等价约束，利用对齐翻译对生成随机合法的混合语言内容进行零样本学习，以解决跨语言语音识别中数据稀缺性、语法结构复杂性和领域匹配问题，实验结果显示，所提出的方法在两个生态有效的混合语言测试集上相对降低了 65.5% 的语言模型困惑度和 7.7% 的 ASR WER，而采用等价约束的人类评估表明，80% 以上的内容质量足够。