Recent studies reveal that integrating new modalities into Large Language
Models (LLMs), such as Vision-Language Models (VLMs), creates a new attack
surface that bypasses existing safety training techniques like Supervised
Fine-tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). While
further SFT and RLHF-based safety training can be conducted in multi-modal
settings, collecting multi-modal training datasets poses a significant
challenge. Inspired by the structural design of recent multi-modal models,
where, regardless of the combination of input modalities, all inputs are
ultimately fused into the language space, we aim to explore whether unlearning
solely in the textual domain can be effective for cross-modality safety
alignment. Our evaluation across six datasets empirically demonstrates the
transferability -- textual unlearning in VLMs significantly reduces the Attack
Success Rate (ASR) to less than 8\% and in some cases, even as low as nearly
2\% for both text-based and vision-text-based attacks, alongside preserving the
utility. Moreover, our experiments show that unlearning with a multi-modal
dataset offers no potential benefits but incurs significantly increased
computational demands, possibly up to 6 times higher.

将新的模态集成到大型语言模型（LLMs）中，如视觉 - 语言模型（VLMs），在绕过现有的安全训练技术（如 SFT 和 RLHF）的同时创造了一个新的攻击面。我们通过在文本领域进行反学习来实现跨模态安全对齐，实验证明在 VLMs 中进行文本反学习显著减少攻击成功率（ASR）至少低于 8％，甚至在某些情况下低至近 2％，同时保留实用性。

跨模态安全对齐：文本消除是否足够？

Cross-Modal Safety Alignment: Is textual unlearning all you need?

Humans are capable of strategically deceptive behavior: behaving helpfully in
most situations, but then behaving very differently in order to pursue
alternative objectives when given the opportunity. If an AI system learned such
a deceptive strategy, could we detect it and remove it using current
state-of-the-art safety training techniques? To study this question, we
construct proof-of-concept examples of deceptive behavior in large language
models (LLMs). For example, we train models that write secure code when the
prompt states that the year is 2023, but insert exploitable code when the
stated year is 2024. We find that such backdoored behavior can be made
persistent, so that it is not removed by standard safety training techniques,
including supervised fine-tuning, reinforcement learning, and adversarial
training (eliciting unsafe behavior and then training to remove it). The
backdoored behavior is most persistent in the largest models and in models
trained to produce chain-of-thought reasoning about deceiving the training
process, with the persistence remaining even when the chain-of-thought is
distilled away. Furthermore, rather than removing backdoors, we find that
adversarial training can teach models to better recognize their backdoor
triggers, effectively hiding the unsafe behavior. Our results suggest that,
once a model exhibits deceptive behavior, standard techniques could fail to
remove such deception and create a false impression of safety.

人类的策略性欺骗行为使我们可以在大多数情况下表现得很有帮助，但当有机会追求其他目标时则表现出截然不同的行为。研究证明，在大型语言模型中存在着例证意图的欺骗行为，并且尽管采用当前最先进的安全培训技术，这种行为很难被检测出和消除。