Ownership verification is currently the most critical and widely adopted
post-hoc method to safeguard model copyright. In general, model owners exploit
it to identify whether a given suspicious third-party model is stolen from them
by examining whether it has particular properties `inherited' from their
released models. Currently, backdoor-based model watermarks are the primary and
cutting-edge methods to implant such properties in the released models.
However, backdoor-based methods have two fatal drawbacks, including harmfulness
and ambiguity. The former indicates that they introduce maliciously
controllable misclassification behaviors ($i.e.$, backdoor) to the watermarked
released models. The latter denotes that malicious users can easily pass the
verification by finding other misclassified samples, leading to ownership
ambiguity.
In this paper, we argue that both limitations stem from the `zero-bit' nature
of existing watermarking schemes, where they exploit the status ($i.e.$,
misclassified) of predictions for verification. Motivated by this
understanding, we design a new watermarking paradigm, $i.e.$, Explanation as a
Watermark (EaaW), that implants verification behaviors into the explanation of
feature attribution instead of model predictions. Specifically, EaaW embeds a
`multi-bit' watermark into the feature attribution explanation of specific
trigger samples without changing the original prediction. We correspondingly
design the watermark embedding and extraction algorithms inspired by
explainable artificial intelligence. In particular, our approach can be used
for different tasks ($e.g.$, image classification and text generation).
Extensive experiments verify the effectiveness and harmlessness of our EaaW and
its resistance to potential attacks.

模型版权的关键问题是所有权验证和水印技术，目前的基于后期方法是通过检查是否具有特定属性来识别可疑的第三方模型是否被盗窃。本文提出了一种新的基于可解释人工智能的水印技术，通过嵌入特征归属的解释中的验证行为来解决现有方法的限制。

水印技术特征归属的无害和多位模型所有权验证

Explanation as a Watermark: Towards Harmless and Multi-bit Model  Ownership Verification via Watermarking Feature Attribution

Fairness in machine learning (ML) has received much attention. However,
existing studies have mainly focused on the distributive fairness of ML models.
The other dimension of fairness, i.e., procedural fairness, has been neglected.
In this paper, we first define the procedural fairness of ML models, and then
give formal definitions of individual and group procedural fairness. We propose
a novel metric to evaluate the group procedural fairness of ML models, called
$GPF_{FAE}$, which utilizes a widely used explainable artificial intelligence
technique, namely feature attribution explanation (FAE), to capture the
decision process of the ML models. We validate the effectiveness of $GPF_{FAE}$
on a synthetic dataset and eight real-world datasets. Our experiments reveal
the relationship between procedural and distributive fairness of the ML model.
Based on our analysis, we propose a method for identifying the features that
lead to the procedural unfairness of the model and propose two methods to
improve procedural fairness after identifying unfair features. Our experimental
results demonstrate that we can accurately identify the features that lead to
procedural unfairness in the ML model, and both of our proposed methods can
significantly improve procedural fairness with a slight impact on model
performance, while also improving distributive fairness.

机器学习中的公平性问题引起了广泛关注，然而现有研究主要关注模型的分配公平性，而忽视了程序公平性。本文首先定义了机器学习模型的程序公平性，然后给出了个体和群体程序公平性的形式化定义，提出了一种用于评估机器学习模型群体程序公平性的新指标 $GPF_{FAE}$，该指标利用特征归因解释的人工智能技术捕捉了模型的决策过程。我们在合成数据集和八个真实数据集上验证了 $GPF_{FAE}$ 的有效性。实验结果揭示了模型程序公平性和分配公平性之间的关系。基于我们的分析，我们提出了一种识别导致模型程序不公平的特征的方法，以及两种改善程序公平性的方法。我们的实验证明，我们可以准确地识别导致模型程序不公平的特征，并且我们提出的两种方法在轻微影响模型性能的同时，可以显著改善程序公平性和分配公平性。

机器学习中的程序公正

Procedural Fairness in Machine Learning

Large language models (LLMs) such as ChatGPT have demonstrated superior
performance on a variety of natural language processing (NLP) tasks including
sentiment analysis, mathematical reasoning and summarization. Furthermore,
since these models are instruction-tuned on human conversations to produce
"helpful" responses, they can and often will produce explanations along with
the response, which we call self-explanations. For example, when analyzing the
sentiment of a movie review, the model may output not only the positivity of
the sentiment, but also an explanation (e.g., by listing the sentiment-laden
words such as "fantastic" and "memorable" in the review). How good are these
automatically generated self-explanations? In this paper, we investigate this
question on the task of sentiment analysis and for feature attribution
explanation, one of the most commonly studied settings in the interpretability
literature (for pre-ChatGPT models). Specifically, we study different ways to
elicit the self-explanations, evaluate their faithfulness on a set of
evaluation metrics, and compare them to traditional explanation methods such as
occlusion or LIME saliency maps. Through an extensive set of experiments, we
find that ChatGPT's self-explanations perform on par with traditional ones, but
are quite different from them according to various agreement metrics, meanwhile
being much cheaper to produce (as they are generated along with the
prediction). In addition, we identified several interesting characteristics of
them, which prompt us to rethink many current model interpretability practices
in the era of ChatGPT(-like) LLMs.

ChatGPT 的自解释性能与传统方法相媲美，在成本较低的情况下，且具有许多有趣的特性，促使我们重新思考当前在 ChatGPT（类似的 LLM）时代的模型可解释性实践。