The rapid evolution of Large Language Models (LLMs) has rendered them
indispensable in modern society. While security measures are typically in place
to align LLMs with human values prior to release, recent studies have unveiled
a concerning phenomenon named "jailbreak." This term refers to the unexpected
and potentially harmful responses generated by LLMs when prompted with
malicious questions. Existing research focuses on generating jailbreak prompts
but our study aim to answer a different question: Is the system message really
important to jailbreak in LLMs? To address this question, we conducted
experiments in a stable GPT version gpt-3.5-turbo-0613 to generated jailbreak
prompts with varying system messages: short, long, and none. We discover that
different system messages have distinct resistances to jailbreak by
experiments. Additionally, we explore the transferability of jailbreak across
LLMs. This finding underscores the significant impact system messages can have
on mitigating LLMs jailbreak. To generate system messages that are more
resistant to jailbreak prompts, we propose System Messages Evolutionary
Algorithms (SMEA). Through SMEA, we can get robust system messages population
that demonstrate up to 98.9% resistance against jailbreak prompts. Our research
not only bolsters LLMs security but also raises the bar for jailbreak,
fostering advancements in this field of study.

通过对大型语言模型的稳定版本进行实验，我们发现不同的系统消息对于防范恶意提问的破解具有不同的抵抗力，针对系统消息与破解的关系，我们提出了系统消息进化算法 (SMEA)，通过该算法，我们获得了抵抗破解的系统消息，其抵抗力可高达 98.9%，这一研究不仅增强了大型语言模型的安全性，也为破解领域的发展提供了指导。

大型语言模型中的系统消息对越狱是否真的重要？

Is the System Message Really Important to Jailbreaks in Large Language  Models?

Watermarking is a commonly used strategy to protect creators' rights to
digital images, videos and audio. Recently, watermarking methods have been
extended to deep learning models -- in principle, the watermark should be
preserved when an adversary tries to copy the model. However, in practice,
watermarks can often be removed by an intelligent adversary. Several papers
have proposed watermarking methods that claim to be empirically resistant to
different types of removal attacks, but these new techniques often fail in the
face of new or better-tuned adversaries. In this paper, we propose a
certifiable watermarking method. Using the randomized smoothing technique
proposed in Chiang et al., we show that our watermark is guaranteed to be
unremovable unless the model parameters are changed by more than a certain l2
threshold. In addition to being certifiable, our watermark is also empirically
more robust compared to previous watermarking methods. Our experiments can be
reproduced with code at this https URL

本文提出了一种可验证的数字水印方法，使用随机平滑技术，保证水印无法被移除，同时与以前的方法相比具有更强的鲁棒性。

具有随机平滑的认证神经网络水印

Certified Neural Network Watermarks with Randomized Smoothing

Adversarial attacks are malicious inputs that derail machine-learning models.
We propose a scheme to attack autoencoders, as well as a quantitative
evaluation framework that correlates well with the qualitative assessment of
the attacks. We assess --- with statistically validated experiments --- the
resistance to attacks of three variational autoencoders (simple, convolutional,
and DRAW) in three datasets (MNIST, SVHN, CelebA), showing that both DRAW's
recurrence and attention mechanism lead to better resistance. As autoencoders
are proposed for compressing data --- a scenario in which their safety is
paramount --- we expect more attention will be given to adversarial attacks on
them.

本文提出了一种新的用于攻击自编码器（autoencoders）的方案，并设计了一个定量评估框架来评估攻击抵抗性。在三个常用数据集上，通过统计验证实验证明带有循环和注意力机制的 DRAW 模型具有更好的抵抗性，这对于自编码器在数据压缩方面的应用十分重要，并引起了更多人对于对抗攻击的关注。