The massive deployment of Machine Learning (ML) models has been accompanied
by the emergence of several attacks that threaten their trustworthiness and
raise ethical and societal concerns such as invasion of privacy, discrimination
risks, and lack of accountability. Model hijacking is one of these attacks,
where the adversary aims to hijack a victim model to execute a different task
than its original one. Model hijacking can cause accountability and security
risks since a hijacked model owner can be framed for having their model
offering illegal or unethical services. Prior state-of-the-art works consider
model hijacking as a training time attack, whereby an adversary requires access
to the ML model training to execute their attack. In this paper, we consider a
stronger threat model where the attacker has no access to the training phase of
the victim model. Our intuition is that ML models, typically
over-parameterized, might (unintentionally) learn more than the intended task
for they are trained. We propose a simple approach for model hijacking at
inference time named SnatchML to classify unknown input samples using distance
measures in the latent space of the victim model to previously known samples
associated with the hijacking task classes. SnatchML empirically shows that
benign pre-trained models can execute tasks that are semantically related to
the initial task. Surprisingly, this can be true even for hijacking tasks
unrelated to the original task. We also explore different methods to mitigate
this risk. We first propose a novel approach we call meta-unlearning, designed
to help the model unlearn a potentially malicious task while training on the
original task dataset. We also provide insights on over-parameterization as one
possible inherent factor that makes model hijacking easier, and we accordingly
propose a compression-based countermeasure against this attack.

我们提出了一种在推断时间进行模型劫持的简单方法 SnatchML，通过在受害模型的潜在空间中使用距离度量来将未知输入样本分类为与劫持任务类别相关的先前已知样本。同时，我们还探讨了不同的方法来缓解这种风险，其中包括一种名为 meta-unlearning 的新方法，用于在原始任务数据集上进行训练时帮助模型遗忘潜在的恶意任务，并针对这种攻击提出了一种基于压缩的对抗措施。

针对花生的模型：在无训练访问的情况下劫持机器学习模型是可行的

Model for Peanuts: Hijacking ML Models without Training Access is  Possible

Federated Learning (FL) has been gaining popularity as a collaborative
learning framework to train deep learning-based object detection models over a
distributed population of clients. Despite its advantages, FL is vulnerable to
model hijacking. The attacker can control how the object detection system
should misbehave by implanting Trojaned gradients using only a small number of
compromised clients in the collaborative learning process. This paper
introduces STDLens, a principled approach to safeguarding FL against such
attacks. We first investigate existing mitigation mechanisms and analyze their
failures caused by the inherent errors in spatial clustering analysis on
gradients. Based on the insights, we introduce a three-tier forensic framework
to identify and expel Trojaned gradients and reclaim the performance over the
course of FL. We consider three types of adaptive attacks and demonstrate the
robustness of STDLens against advanced adversaries. Extensive experiments show
that STDLens can protect FL against different model hijacking attacks and
outperform existing methods in identifying and removing Trojaned gradients with
significantly higher precision and much lower false-positive rates.

本文提出了一种名为 STDLens 的三层取证框架，以保护联邦学习（FL）免受模型劫持的攻击。通过识别和驱逐 Trojaned 梯度，STDLens 可以对不同的模型劫持攻击进行保护，并且在识别和删除 Trojaned 梯度方面的精度和假阳性率方面均优于现有的方法。