Counterfactual explanations can be used to interpret and debug text classifiers by producing minimally altered text inputs that change a classifier's output. In this work, we evaluate five methods for generating counterfactual explanations for a BERT text classifier on two datasets using three evaluation metrics. The results of our experiments suggest that established white-box substitution-based methods are effective at generating valid counterfactuals that change the classifier's output. In contrast, newer methods based on large language models (LLMs) excel at producing natural and linguistically plausible text counterfactuals but often fail to generate valid counterfactuals that alter the classifier's output. Based on these results, we recommend developing new counterfactual explanation methods that combine the strengths of established gradient-based approaches and newer LLM-based techniques to generate high-quality, valid, and plausible text counterfactual explanations.

本文研究了反事实解释方法在文本分类器中的应用，针对五种方法进行了比较，发现传统的置换基础方法在生成有效反事实方面表现良好，而基于大型语言模型的新方法则在生成自然语言文本方面表现出色，但常常未能改变分类器的输出。研究建议结合这两类方法的优点，以开发新型高质量的反事实解释方法。

文本分类器的反事实解释方法比较分析