Incorporating natural language rationales in the prompt and In-Context
Learning (ICL) has led to a significant improvement of Large Language Models
(LLMs) performance. However, rationales currently require human-annotation or
the use of auxiliary proxy models to target promising samples or generate
high-quality rationales. In this work, we propose Self-AMPLIFY to generate
automatically rationales from post hoc explanation methods applied to Small
Language Models (SLMs) to improve their own performance. Self-AMPLIFY is a
3-step method that targets samples, generates rationales and builds a final
prompt to leverage ICL. Self-AMPLIFY performance is evaluated on two SLMs and
two datasets requiring reasoning abilities: these experiments show that
Self-AMPLIFY achieves good results against competitors. Self-AMPLIFY is the
first method to apply post hoc explanation methods to SLM to generate
rationales to improve their own performance in a fully automated manner.

自动化方法 Self-AMPLIFY 将事后解释方法应用于小型语言模型，生成有理性解释并改善其性能。

Self-AMPLIFY: 提高小型语言模型性能的自解释方法

Self-AMPLIFY: Improving Small Language Models with Self Post Hoc  Explanations

With the increased deployment of machine learning models in various
real-world applications, researchers and practitioners alike have emphasized
the need for explanations of model behaviour. To this end, two broad strategies
have been outlined in prior literature to explain models. Post hoc explanation
methods explain the behaviour of complex black-box models by highlighting
features that are critical to model predictions; however, prior work has shown
that these explanations may not be faithful, and even more concerning is our
inability to verify them. Specifically, it is nontrivial to evaluate if a given
attribution is correct with respect to the underlying model. Inherently
interpretable models, on the other hand, circumvent these issues by explicitly
encoding explanations into model architecture, meaning their explanations are
naturally faithful and verifiable, but they often exhibit poor predictive
performance due to their limited expressive power. In this work, we aim to
bridge the gap between the aforementioned strategies by proposing Verifiability
Tuning (VerT), a method that transforms black-box models into models that
naturally yield faithful and verifiable feature attributions. We begin by
introducing a formal theoretical framework to understand verifiability and show
that attributions produced by standard models cannot be verified. We then
leverage this framework to propose a method to build verifiable models and
feature attributions out of fully trained black-box models. Finally, we perform
extensive experiments on semi-synthetic and real-world datasets, and show that
VerT produces models that (1) yield explanations that are correct and
verifiable and (2) are faithful to the original black-box models they are meant
to explain.

通过 VerT 方法，将黑盒模型转化为生成可信且可验证特征归因的模型，从而弥合了先前研究中的解释策略差距。

可验证特征归因：后解释性与内在可解释性之间的桥梁

Verifiable Feature Attributions: A Bridge between Post Hoc  Explainability and Inherent Interpretability

While several types of post hoc explanation methods (e.g., feature
attribution methods) have been proposed in recent literature, there is little
to no work on systematically benchmarking these methods in an efficient and
transparent manner. Here, we introduce OpenXAI, a comprehensive and extensible
open source framework for evaluating and benchmarking post hoc explanation
methods. OpenXAI comprises of the following key components: (i) a flexible
synthetic data generator and a collection of diverse real-world datasets,
pre-trained models, and state-of-the-art feature attribution methods, (ii)
open-source implementations of twenty-two quantitative metrics for evaluating
faithfulness, stability (robustness), and fairness of explanation methods, and
(iii) the first ever public XAI leaderboards to benchmark explanations. OpenXAI
is easily extensible, as users can readily evaluate custom explanation methods
and incorporate them into our leaderboards. Overall, OpenXAI provides an
automated end-to-end pipeline that not only simplifies and standardizes the
evaluation of post hoc explanation methods, but also promotes transparency and
reproducibility in benchmarking these methods. OpenXAI datasets and data
loaders, implementations of state-of-the-art explanation methods and evaluation
metrics, as well as leaderboards are publicly available at
this https URL.

介绍了 OpenXAI—— 一种全面的且可扩展的开源框架，用于评估和基准测试事后解释方法。 OpenXAI 包括一个灵活的合成数据生成器和各种真实世界数据集，预先训练的模型和最先进的特征归因方法的集合，以及评估解释方法准确性、稳定性和公平性的 22 种定量度量的开源实现，并且该框架还包含公开的 XAI 排行榜，用于基准测试解释方法。

OpenXAI：朝向机器学习模型解释的透明化评估

OpenXAI: Towards a Transparent Evaluation of Model Explanations

As various post hoc explanation methods are increasingly being leveraged to
explain complex models in high-stakes settings, it becomes critical to develop
a deeper understanding of if and when the explanations output by these methods
disagree with each other, and how such disagreements are resolved in practice.
However, there is little to no research that provides answers to these critical
questions. In this work, we introduce and study the disagreement problem in
explainable machine learning. More specifically, we formalize the notion of
disagreement between explanations, analyze how often such disagreements occur
in practice, and how do practitioners resolve these disagreements. To this end,
we first conduct interviews with data scientists to understand what constitutes
disagreement between explanations generated by different methods for the same
model prediction, and introduce a novel quantitative framework to formalize
this understanding. We then leverage this framework to carry out a rigorous
empirical analysis with four real-world datasets, six state-of-the-art post hoc
explanation methods, and eight different predictive models, to measure the
extent of disagreement between the explanations generated by various popular
explanation methods. In addition, we carry out an online user study with data
scientists to understand how they resolve the aforementioned disagreements. Our
results indicate that state-of-the-art explanation methods often disagree in
terms of the explanations they output. Our findings also underscore the
importance of developing principled evaluation metrics that enable
practitioners to effectively compare explanations.

通过研究解释性机器学习中的不一致性问题，本文介绍了一种定量框架来形式化不同解释方法生成的解释之间的不一致性，并通过萨实证分析和在线调查了解了数据科学家如何解决这些分歧。结果表明，当今最先进的解释方法在生成解释方面经常存在分歧，强调了开发原则性评估指标以实现有效比较解释的重要性。