Diffusion models have achieved great success in a range of tasks, such as image synthesis and molecule design. As such successes hinge on large-scale training data collected from diverse sources, the trustworthiness of these collected data is hard to control or audit. In this work, we aim to explore the vulnerabilities of diffusion models under potential training data manipulations and try to answer: How hard is it to perform Trojan attacks on well-trained diffusion models? What are the adversarial targets that such Trojan attacks can achieve? To answer these questions, we propose an effective Trojan attack against diffusion models, TrojDiff, which optimizes the Trojan diffusion and generative processes during training. In particular, we design novel transitions during the Trojan diffusion process to diffuse adversarial targets into a biased Gaussian distribution and propose a new parameterization of the Trojan generative process that leads to an effective training objective for the attack. In addition, we consider three types of adversarial targets: the Trojaned diffusion models will always output instances belonging to a certain class from the in-domain distribution (In-D2D attack), out-of-domain distribution (Out-D2D-attack), and one specific instance (D2I attack). We evaluate TrojDiff on CIFAR-10 and CelebA datasets against both DDPM and DDIM diffusion models. We show that TrojDiff always achieves high attack performance under different adversarial targets using different types of triggers, while the performance in benign environments is preserved. The code is available at https://github.com/chenweixin107/TrojDiff.

本文旨在探讨扰动模型在潜在的训练数据操纵下的脆弱性，提出了一种有效的 Trojan 攻击模型 TrojDiff，通过Trojan扩散和生成过程的优化对抗攻击，包括把对抗目标扩散到一个有偏高斯分布中、提出新的参数化的 Trojan 生成功能等，演示了在 CIFAR-10 和 CelebA 数据集上对 DDPM 和 DDIM 扰动模型执行不同类型的 Trojan 攻击的表现。

TrojDiff：针对多样化目标的扩散模型的木马攻击