To build intelligent machine learning systems, there are two broad
approaches. One approach is to build inherently interpretable models, as
endeavored by the growing field of causal representation learning. The other
approach is to build highly-performant foundation models and then invest
efforts into understanding how they work. In this work, we relate these two
approaches and study how to learn human-interpretable concepts from data.
Weaving together ideas from both fields, we formally define a notion of
concepts and show that they can be provably recovered from diverse data.
Experiments on synthetic data and large language models show the utility of our
unified approach.

通过结合因果表示学习和理解如何从数据中学习可理解概念的思想，本研究正式定义了一个概念的概念，并证明了它们可以从多样数据中被可靠地还原，合成数据和大型语言模型上的实验表明了我们统一方法的实用性。

学习可解释概念：统一因果表示学习与基础模型

Learning Interpretable Concepts: Unifying Causal Representation Learning  and Foundation Models

The representation space of neural models for textual data emerges in an
unsupervised manner during training. Understanding how those representations
encode human-interpretable concepts is a fundamental problem. One prominent
approach for the identification of concepts in neural representations is
searching for a linear subspace whose erasure prevents the prediction of the
concept from the representations. However, while many linear erasure algorithms
are tractable and interpretable, neural networks do not necessarily represent
concepts in a linear manner. To identify non-linearly encoded concepts, we
propose a kernelization of a linear minimax game for concept erasure. We
demonstrate that it is possible to prevent specific non-linear adversaries from
predicting the concept. However, the protection does not transfer to different
nonlinear adversaries. Therefore, exhaustively erasing a non-linearly encoded
concept remains an open problem.

本文提出一种核化线性 minimax 游戏的方法以实现对神经模型中的非线性编码概念的抹除，虽然保护措施不能转移到不同的非线性对手，因此彻底抹除非线性概念仍是一个待解决的问题。

基于核函数的概念抹除

Kernelized Concept Erasure

Developing algorithms that are able to generalize to a novel task given only
a few labeled examples represents a fundamental challenge in closing the gap
between machine- and human-level performance. The core of human cognition lies
in the structured, reusable concepts that help us to rapidly adapt to new tasks
and provide reasoning behind our decisions. However, existing meta-learning
methods learn complex representations across prior labeled tasks without
imposing any structure on the learned representations. Here we propose COMET, a
meta-learning method that improves generalization ability by learning to learn
along human-interpretable concept dimensions. Instead of learning a joint
unstructured metric space, COMET learns mappings of high-level concepts into
semi-structured metric spaces, and effectively combines the outputs of
independent concept learners. We evaluate our model on few-shot tasks from
diverse domains, including fine-grained image classification, document
categorization and cell type annotation on a novel dataset from a biological
domain developed in our work. COMET significantly outperforms strong
meta-learning baselines, achieving 6-15% relative improvement on the most
challenging 1-shot learning tasks, while unlike existing methods providing
interpretations behind the model's predictions.

COMET 是一种元学习方法，通过学习沿着可解释的人类概念维度的模式知识，从而改进泛化能力，而不是学习一个联合的无结构度量空间。在各种领域中的少样本任务中，COMET 的表现优于强元学习基线，且提供模型预测背后的解释。