Machine unlearning is a promising approach to mitigate undesirable
memorization of training data in ML models. However, in this work we show that
existing approaches for unlearning in LLMs are surprisingly susceptible to a
simple set of targeted relearning attacks. With access to only a small and
potentially loosely related set of data, we find that we can 'jog' the memory
of unlearned models to reverse the effects of unlearning. We formalize this
unlearning-relearning pipeline, explore the attack across three popular
unlearning benchmarks, and discuss future directions and guidelines that result
from our study.

机器不学习是一种减轻机器学习模型中训练数据不良记忆的有希望的方法。然而，在这项工作中，我们显示出现有的 LLMs 取消学习方法意外地容易受到一组简单有针对性的重新学习攻击的影响。通过仅访问少量可能松散相关的数据集，我们发现可以 “调整” 取消学习模型的记忆以逆转取消学习的效果。我们系统地阐述了这种取消学习 - 重新学习流程，探索了三个流行的取消学习基准测试中的攻击，并讨论了我们研究的结果产生的未来方向和指南。