The adoption of artificial intelligence (AI) across industries has led to the
widespread use of complex black-box models and interpretation tools for
decision making. This paper proposes an adversarial framework to uncover the
vulnerability of permutation-based interpretation methods for machine learning
tasks, with a particular focus on partial dependence (PD) plots. This
adversarial framework modifies the original black box model to manipulate its
predictions for instances in the extrapolation domain. As a result, it produces
deceptive PD plots that can conceal discriminatory behaviors while preserving
most of the original model's predictions. This framework can produce multiple
fooled PD plots via a single model. By using real-world datasets including an
auto insurance claims dataset and COMPAS (Correctional Offender Management
Profiling for Alternative Sanctions) dataset, our results show that it is
possible to intentionally hide the discriminatory behavior of a predictor and
make the black-box model appear neutral through interpretation tools like PD
plots while retaining almost all the predictions of the original black-box
model. Managerial insights for regulators and practitioners are provided based
on the findings.

该论文提出了一种对机器学习任务中基于排列的解释方法的脆弱性进行揭示的对抗性框架，特别关注了偏依赖图。通过修改原始黑盒模型以操作外推领域的实例预测，该框架能产生欺骗性的偏依赖图，可掩盖歧视行为并保留原模型大部分预测，从而使黑盒模型在 PD 图等解释工具下显得中立。研究结果使用真实数据集进行验证，发现可有意隐藏预测器的歧视行为，提供了对监管机构和从业人员的管理洞察。

机器学习中解释性不可信的原因：对部分依赖图的敌对攻击

Why You Should Not Trust Interpretations in Machine Learning:  Adversarial Attacks on Partial Dependence Plots

Recent research has identified discriminatory behavior of automated
prediction algorithms towards groups identified on specific protected
attributes (e.g., gender, ethnicity, age group, etc.). When deployed in
real-world scenarios, such techniques may demonstrate biased predictions
resulting in unfair outcomes. Recent literature has witnessed algorithms for
mitigating such biased behavior mostly by adding convex surrogates of fairness
metrics such as demographic parity or equalized odds in the loss function,
which are often not easy to estimate. This research proposes a novel
in-processing based GroupMixNorm layer for mitigating bias from deep learning
models. The GroupMixNorm layer probabilistically mixes group-level feature
statistics of samples across different groups based on the protected attribute.
The proposed method improves upon several fairness metrics with minimal impact
on overall accuracy. Analysis on benchmark tabular and image datasets
demonstrates the efficacy of the proposed method in achieving state-of-the-art
performance. Further, the experimental analysis also suggests the robustness of
the GroupMixNorm layer against new protected attributes during inference and
its utility in eliminating bias from a pre-trained network.

提出了一种基于 GroupMixNorm 层的处理方法，通过混合样本的组级特征统计数据来减轻深度学习模型中的偏见，该方法在减少偏见的同时对整体准确性影响较小。实验结果表明该方法在标准数据集上取得了最先进的性能，同时还表明 GroupMixNorm 层对于新的受保护属性在推理过程中的稳健性及其在预训练网络中消除偏见的效用。