Language models (LMs) are susceptible to "memorizing" training data,
including a large amount of private or copyright-protected content. To
safeguard the right to be forgotten (RTBF), machine unlearning has emerged as a
promising method for LMs to efficiently "forget" sensitive training content and
mitigate knowledge leakage risks. However, despite its good intentions, could
the unlearning mechanism be counterproductive? In this paper, we propose the
Textual Unlearning Leakage Attack (TULA), where an adversary can infer
information about the unlearned data only by accessing the models before and
after unlearning. Furthermore, we present variants of TULA in both black-box
and white-box scenarios. Through various experimental results, we critically
demonstrate that machine unlearning amplifies the risk of knowledge leakage
from LMs. Specifically, TULA can increase an adversary's ability to infer
membership information about the unlearned data by more than 20% in black-box
scenario. Moreover, TULA can even reconstruct the unlearned data directly with
more than 60% accuracy with white-box access. Our work is the first to reveal
that machine unlearning in LMs can inversely create greater knowledge risks and
inspire the development of more secure unlearning mechanisms.

通过借助先前和后续模型访问，我们提出了文本消除泄漏攻击（TULA），从而证明机器遗忘在语言模型中会扩大知识泄漏的风险，包括黑盒和白盒场景下推断未学习数据的能力增强，以及通过白盒访问直接重构未学习数据的准确性。这项工作首次揭示了语言模型中的机器遗忘逆向创造了更大的知识风险，并鼓励更安全的遗忘机制的发展。