The increasing difficulty to distinguish language-model-generated from human-written text has led to the development of detectors of machine-generated text (MGT). However, in many contexts, a black-box prediction is not sufficient, it is equally important to know on what grounds a detector made that prediction. Explanation methods that estimate feature importance promise to provide indications of which parts of an input are used by classifiers for prediction. However, the quality of different explanation methods has not previously been assessed for detectors of MGT. This study conducts the first systematic evaluation of explanation quality for this task. The dimensions of faithfulness and stability are assessed with five automated experiments, and usefulness is evaluated in a user study. We use a dataset of ChatGPT-generated and human-written documents, and pair predictions of three existing language-model-based detectors with the corresponding SHAP, LIME, and Anchor explanations. We find that SHAP performs best in terms of faithfulness, stability, and in helping users to predict the detector's behavior. In contrast, LIME, perceived as most useful by users, scores the worst in terms of user performance at predicting the detectors' behavior.

本研究解决了机器生成文本（MGT）检测器解释质量的评估问题，首次系统地评估了不同解释方法（SHAP、LIME和Anchor）在该领域的效果。研究发现，SHAP在可信度和稳定性方面表现最佳，能有效帮助用户预测检测器的行为，而LIME尽管被用户认为最有用，但在用户预测表现上最差。

对机器生成文本黑箱检测器解释方法的评估