Due to data's unavailability or large size, and the high computational and
human labor costs of training machine learning models, it is a common practice
to rely on open source pre-trained models whenever possible. However, this
practice is worry some from the security perspective. Pre-trained models can be
infected with Trojan attacks, in which the attacker embeds a trigger in the
model such that the model's behavior can be controlled by the attacker when the
trigger is present in the input. In this paper, we present our preliminary work
on a novel method for Trojan model detection. Our method creates a signature
for a model based on activation optimization. A classifier is then trained to
detect a Trojan model given its signature. Our method achieves state of the art
performance on two public datasets.

本文提出了一种基于激活优化的机器学习预训练模型特征提取方法，并训练分类器来检测 Trojan 模型，该方法在两个公开数据集上达到了最先进的性能。

利用激活优化检测特洛伊模型

Trojan Model Detection Using Activation Optimization

Over the past years, deep generative models have achieved a new level of
performance. Generated data has become difficult, if not impossible, to be
distinguished from real data. While there are plenty of use cases that benefit
from this technology, there are also strong concerns on how this new technology
can be misused to generate deep fakes and enable misinformation at scale.
Unfortunately, current deep fake detection methods are not sustainable, as the
gap between real and fake continues to close. In contrast, our work enables a
responsible disclosure of such state-of-the-art generative models, that allows
model inventors to fingerprint their models, so that the generated samples
containing a fingerprint can be accurately detected and attributed to a source.
Our technique achieves this by an efficient and scalable ad-hoc generation of a
large population of models with distinct fingerprints. Our recommended
operation point uses a 128-bit fingerprint which in principle results in more
than $10^{38}$ identifiable models. Experiments show that our method fulfills
key properties of a fingerprinting mechanism and achieves effectiveness in deep
fake detection and attribution. Code and models are available at
this https URL .

研究人员开发出一种称为指纹技术的机制，该技术旨在检测并追溯深度生成模型的使用，以防止其被用于创建深度伪造和传播虚假信息。

使用可扩展的指纹技术负责任披露生成模型

Responsible Disclosure of Generative Models Using Scalable  Fingerprinting

We want to detect whether a particular image dataset has been used to train a
model. We propose a new technique, \emph{radioactive data}, that makes
imperceptible changes to this dataset such that any model trained on it will
bear an identifiable mark. The mark is robust to strong variations such as
different architectures or optimization methods. Given a trained model, our
technique detects the use of radioactive data and provides a level of
confidence (p-value). Our experiments on large-scale benchmarks (Imagenet),
using standard architectures (Resnet-18, VGG-16, Densenet-121) and training
procedures, show that we can detect usage of radioactive data with high
confidence (p<10^-4) even when only 1% of the data used to trained our model is
radioactive. Our method is robust to data augmentation and the stochasticity of
deep network optimization. As a result, it offers a much higher signal-to-noise
ratio than data poisoning and backdoor methods.

使用无法感知的放射性数据技术可以检测出数据集对模型的训练；该技术提供了更高的信噪比，也更为鲁棒。

放射性数据：追踪训练

Radioactive data: tracing through training

In machine learning Trojan attacks, an adversary trains a corrupted model
that obtains good performance on normal data but behaves maliciously on data
samples with certain trigger patterns. Several approaches have been proposed to
detect such attacks, but they make undesirable assumptions about the attack
strategies or require direct access to the trained models, which restricts
their utility in practice.
This paper addresses these challenges by introducing a Meta Neural Trojan
Detection (MNTD) pipeline that does not make assumptions on the attack
strategies and only needs black-box access to models. The strategy is to train
a meta-classifier that predicts whether a given target model is Trojaned. To
train the meta-model without knowledge of the attack strategy, we introduce a
technique called jumbo learning that samples a set of Trojaned models following
a general distribution. We then dynamically optimize a query set together with
the meta-classifier to distinguish between Trojaned and benign models.
We evaluate MNTD with experiments on vision, speech, tabular data and natural
language text datasets, and against different Trojan attacks such as data
poisoning attack, model manipulation attack, and latent attack. We show that
MNTD achieves 97% detection AUC score and significantly outperforms existing
detection approaches. In addition, MNTD generalizes well and achieves high
detection performance against unforeseen attacks. We also propose a robust MNTD
pipeline which achieves 90% detection AUC even when the attacker aims to evade
the detection with full knowledge of the system.

本文提出了 Meta Neural Trojan Detection (MNTD) 管道来解决机器学习特洛伊木马攻击检测的挑战，通过训练一个能够预测目标模型是否被特洛伊木马攻击的元分类器来检测黑盒模型，同时引入 jumbo learning 以对特洛伊木马攻击进行分类和预测。在试验和对比中表明，MNTD 达到了 97% 的检测 AUC 分数，并优于现有的检测方法。