Counterfactual text generation aims to minimally change a text, such that it is classified differently. Judging advancements in method development for counterfactual text generation is hindered by a non-uniform usage of data sets and metrics in related work. We propose CEval, a benchmark for comparing counterfactual text generation methods. CEval unifies counterfactual and text quality metrics, includes common counterfactual datasets with human annotations, standard baselines (MICE, GDBA, CREST) and the open-source language model LLAMA-2. Our experiments found no perfect method for generating counterfactual text. Methods that excel at counterfactual metrics often produce lower-quality text while LLMs with simple prompts generate high-quality text but struggle with counterfactual criteria. By making CEval available as an open-source Python library, we encourage the community to contribute more methods and maintain consistent evaluation in future work.

反事实文本生成的基准评估库CEval，结合了反事实和文本质量指标，包含了常用的反事实数据集和标注，以及标准基线模型和开源语言模型LLAMA-2。实验结果显示，目前还没有完美的生成反事实文本的方法。在反事实指标方面表现优异的方法往往生成质量较低的文本，而使用简单提示的语言模型则能生成高质量的文本，但在反事实准则上有困难。通过将CEval作为开源Python库公开，鼓励社区贡献更多方法，并在未来的研究中保持一致的评估标准。

CEval：用于评估反事实文本生成的基准