Neural network models trained on text data have been found to encode
undesirable linguistic or sensitive concepts in their representation. Removing
such concepts is non-trivial because of a complex relationship between the
concept, text input, and the learnt representation. Recent work has proposed
post-hoc and adversarial methods to remove such unwanted concepts from a
model's representation. Through an extensive theoretical and empirical
analysis, we show that these methods can be counter-productive: they are unable
to remove the concepts entirely, and in the worst case may end up destroying
all task-relevant features. The reason is the methods' reliance on a probing
classifier as a proxy for the concept. Even under the most favorable conditions
for learning a probing classifier when a concept's relevant features in
representation space alone can provide 100% accuracy, we prove that a probing
classifier is likely to use non-concept features and thus post-hoc or
adversarial methods will fail to remove the concept correctly. These
theoretical implications are confirmed by experiments on models trained on
synthetic, Multi-NLI, and Twitter datasets. For sensitive applications of
concept removal such as fairness, we recommend caution against using these
methods and propose a spuriousness metric to gauge the quality of the final
classifier.

研究人员发现，基于文本数据训练的神经网络模型存在不可取的语言或敏感概念问题。本文通过广泛的理论和实证分析，证明了使用事后和对抗方法无法完全删除有问题的概念，并有可能破坏所有有用任务特征，并建议使用伪度量衡量最终分类器的质量。

探针分类器在概念移除和检测中不可靠

Probing Classifiers are Unreliable for Concept Removal and Detection

We propose a novel framework ConceptX, to analyze how latent concepts are
encoded in representations learned within pre-trained language models. It uses
clustering to discover the encoded concepts and explains them by aligning with
a large set of human-defined concepts. Our analysis on seven transformer
language models reveal interesting insights: i) the latent space within the
learned representations overlap with different linguistic concepts to a varying
degree, ii) the lower layers in the model are dominated by lexical concepts
(e.g., affixation), whereas the core-linguistic concepts (e.g., morphological
or syntactic relations) are better represented in the middle and higher layers,
iii) some encoded concepts are multi-faceted and cannot be adequately explained
using the existing human-defined concepts.

本文提出了一个新颖的框架 ConceptX，利用聚类发现预训练语言模型中编码的潜在概念，并通过与大量人类定义的概念进行对齐进行解释。它在七个变压器语言模型上的分析揭示了有趣的见解：i）学习表示中的潜在空间以不同的程度与不同的语言概念重叠，ii）模型中的较低层由词汇概念（例如，词缀）主导，而核心语言概念（例如，形态或句法关系）在中高层中更好地表示，iii）一些编码的概念具有多面性，无法用现有的人类定义概念充分说明。