Machine learning models may inadvertently memorize sensitive, unauthorized,
or malicious data, posing risks of privacy violations, security breaches, and
performance deterioration. To address these issues, machine unlearning has
emerged as a critical technique to selectively remove specific training data
points' influence on trained models. This paper provides a comprehensive
taxonomy and analysis of machine unlearning research. We categorize existing
research into exact unlearning that algorithmically removes data influence
entirely and approximate unlearning that efficiently minimizes influence
through limited parameter updates. By reviewing the state-of-the-art solutions,
we critically discuss their advantages and limitations. Furthermore, we propose
future directions to advance machine unlearning and establish it as an
essential capability for trustworthy and adaptive machine learning. This paper
provides researchers with a roadmap of open problems, encouraging impactful
contributions to address real-world needs for selective data removal.

机器遗忘是一项关键技术，用于选择性地移除训练数据点对训练模型的影响。本文提供了机器遗忘研究的全面分类和分析，并回顾了最新的解决方案，讨论了其优势和限制，并提出了未来的发展方向，以推动机器遗忘作为一项必要的可信和自适应机器学习能力。

机器取消学习：解决方案与挑战

Machine Unlearning: Solutions and Challenges

Recent explorations with commercial Large Language Models (LLMs) have shown
that non-expert users can jailbreak LLMs by simply manipulating the prompts;
resulting in degenerate output behavior, privacy and security breaches,
offensive outputs, and violations of content regulator policies. Limited formal
studies have been carried out to formalize and analyze these attacks and their
mitigations. We bridge this gap by proposing a formalism and a taxonomy of
known (and possible) jailbreaks. We perform a survey of existing jailbreak
methods and their effectiveness on open-source and commercial LLMs (such as GPT
3.5, OPT, BLOOM, and FLAN-T5-xxl). We further propose a limited set of prompt
guards and discuss their effectiveness against known attack types.

本研究提出了形式主义和已知（和可能的）越狱攻击分类，并在开源和商业 LLM（如 GPT 3.5，OPT，BLOOM 和 FLAN-T5-xxl）上进行了现有越狱方法及其有效性的调查；我们进一步提出了一组有限的提示守卫，并讨论了其对已知攻击类型的有效性。

欺骗 LLMs 反抗：理解、分析和预防越狱

Tricking LLMs into Disobedience: Understanding, Analyzing, and  Preventing Jailbreaks

Given a state-of-the-art deep neural network classifier, we show the
existence of a universal (image-agnostic) and very small perturbation vector
that causes natural images to be misclassified with high probability. We
propose a systematic algorithm for computing universal perturbations, and show
that state-of-the-art deep neural networks are highly vulnerable to such
perturbations, albeit being quasi-imperceptible to the human eye. We further
empirically analyze these universal perturbations and show, in particular, that
they generalize very well across neural networks. The surprising existence of
universal perturbations reveals important geometric correlations among the
high-dimensional decision boundary of classifiers. It further outlines
potential security breaches with the existence of single directions in the
input space that adversaries can possibly exploit to break a classifier on most
natural images.

本文研究了深度神经网络分类器，发现存在普适的微小扰动对所有图像都造成高概率的错误分类，并提出了计算普适扰动的系统算法，证明现有神经网络非常容易受到该扰动攻击，从而出现对人眼几乎无法察觉的误分类。我们为了进一步探究这些扰动，对多个神经网络进行了实证分析并发现它们具有良好的通用性，揭示了分类器高维决策边界之间的重要几何相关性，并指出任何攻击者都可以在输入空间中利用这些单方向的存在来破坏大多数自然图像的分类器带来潜在的安全隐患。