Explainability models are now prevalent within machine learning to address
the black-box nature of neural networks. The question now is which
explainability model is most effective. Probabilistic Lipschitzness has
demonstrated that the smoothness of a neural network is fundamentally linked to
the quality of post hoc explanations. In this work, we prove theoretical lower
bounds on the probabilistic Lipschitzness of Integrated Gradients, LIME and
SmoothGrad. We propose a novel metric using probabilistic Lipschitzness,
normalised astuteness, to compare the robustness of explainability models.
Further, we prove a link between the local Lipschitz constant of a neural
network and its stable rank. We then demonstrate that the stable rank of a
neural network provides a heuristic for the robustness of explainability
models.

该研究论文探讨了可解释性模型中概率 Lipschitz 性以及稳定秩与神经网络的关联，并提出了一种新的指标，正则灵巧度，用于比较可解释性模型的稳健性。研究还揭示了稳定秩与可解释性模型的稳健性之间的关联。

用于比较解释模型的概率 Lipschitzness 和稳定秩

Probabilistic Lipschitzness and the Stable Rank for Comparing  Explanation Models

Large Language Models (LLMs) have demonstrated remarkable capabilities in
performing complex tasks. Moreover, recent research has shown that
incorporating human-annotated rationales (e.g., Chain-of- Thought prompting)
during in-context learning can significantly enhance the performance of these
models, particularly on tasks that require reasoning capabilities. However,
incorporating such rationales poses challenges in terms of scalability as this
requires a high degree of human involvement. In this work, we present a novel
framework, Amplifying Model Performance by Leveraging In-Context Learning with
Post Hoc Explanations (AMPLIFY), which addresses the aforementioned challenges
by automating the process of rationale generation. To this end, we leverage
post hoc explanation methods which output attribution scores (explanations)
capturing the influence of each of the input features on model predictions.
More specifically, we construct automated natural language rationales that
embed insights from post hoc explanations to provide corrective signals to
LLMs. Extensive experimentation with real-world datasets demonstrates that our
framework, AMPLIFY, leads to prediction accuracy improvements of about 10-25%
over a wide range of tasks, including those where prior approaches which rely
on human-annotated rationales such as Chain-of-Thought prompting fall short.
Our work makes one of the first attempts at highlighting the potential of post
hoc explanations as valuable tools for enhancing the effectiveness of LLMs.
Furthermore, we conduct additional empirical analyses and ablation studies to
demonstrate the impact of each of the components of AMPLIFY, which, in turn,
lead to critical insights for refining in-context learning.

AMPLIFY 框架使用后续解释的方法，自动生成自然语言解释以提供纠正信号，从而提高 Large Language Models 的预测准确率。

后验解释能够提高语言模型的性能

Post Hoc Explanations of Language Models Can Improve Language Models

An increasing number of machine learning models have been deployed in domains
with high stakes such as finance and healthcare. Despite their superior
performances, many models are black boxes in nature which are hard to explain.
There are growing efforts for researchers to develop methods to interpret these
black-box models. Post hoc explanations based on perturbations, such as LIME,
are widely used approaches to interpret a machine learning model after it has
been built. This class of methods has been shown to exhibit large instability,
posing serious challenges to the effectiveness of the method itself and harming
user trust. In this paper, we propose S-LIME, which utilizes a hypothesis
testing framework based on central limit theorem for determining the number of
perturbation points needed to guarantee stability of the resulting explanation.
Experiments on both simulated and real world data sets are provided to
demonstrate the effectiveness of our method.

研究黑盒机器学习模型的解释方法，提出一种基于中心极限定理的假设测试框架方法，名为 S-LIME，以保证解释结果的稳定性，实验结果在模拟和真实数据集上表明该方法的有效性。

S-LIME: 模型解释的稳定化 LIME

S-LIME: Stabilized-LIME for Model Explanation

As machine learning black boxes are increasingly being deployed in real-world
applications, there has been a growing interest in developing post hoc
explanations that summarize the behaviors of these black boxes. However,
existing algorithms for generating such explanations have been shown to lack
stability and robustness to distribution shifts. We propose a novel framework
for generating robust and stable explanations of black box models based on
adversarial training. Our framework optimizes a minimax objective that aims to
construct the highest fidelity explanation with respect to the worst-case over
a set of adversarial perturbations. We instantiate this algorithm for
explanations in the form of linear models and decision sets by devising the
required optimization procedures. To the best of our knowledge, this work makes
the first attempt at generating post hoc explanations that are robust to a
general class of adversarial perturbations that are of practical interest.
Experimental evaluation with real-world and synthetic datasets demonstrates
that our approach substantially improves robustness of explanations without
sacrificing their fidelity on the original data distribution.

通过敌对训练的方法，我们提出了一个生成稳健且高保真黑盒模型解释的新框架，尝试解决现有算法在受到分布偏移时缺乏稳定性和鲁棒性的问题，本文是首次尝试生成对一类有实际意义的敌对扰动具有鲁棒性的后续解释，实验发现我们的方法显著提高了解释的鲁棒性，而不会在原始数据分布上牺牲解释的保真度。