Jailbreak attacks, where harmful prompts bypass generative models' built-in safety, raise serious concerns about model vulnerability. While many defense methods have been proposed, the trade-offs between safety and helpfulness, and their application to Large Vision-Language Models (LVLMs), are not well understood. This paper systematically examines jailbreak defenses by reframing the standard generation task as a binary classification problem to assess model refusal tendencies for both harmful and benign queries. We identify two key defense mechanisms: safety shift, which increases refusal rates across all queries, and harmfulness discrimination, which improves the model's ability to distinguish between harmful and benign inputs. Using these mechanisms, we develop two ensemble defense strategies-inter-mechanism ensembles and intra-mechanism ensembles-to balance safety and helpfulness. Experiments on the MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these strategies effectively improve model safety or optimize the trade-off between safety and helpfulness.

本研究针对监狱逃脱攻击，分析现有防御方法在安全性与实用性之间的权衡，尤其是在大型视觉语言模型的应用中。作者提出了安全性转移和有害性辨别两种主要防御机制，并基于此开发了交互机制集成和内部机制集成等策略，以优化安全性与实用性的平衡。实验证明这些策略有效提升了模型的安全性。 

监狱逃脱防御的工作原理及其集成机制研究