Large language models (LLMs) have skyrocketed in popularity in recent years
due to their ability to generate high-quality text in response to human
prompting. However, these models have been shown to have the potential to
generate harmful content in response to user prompting (e.g., giving users
instructions on how to commit crimes). There has been a focus in the literature
on mitigating these risks, through methods like aligning models with human
values through reinforcement learning. However, it has been shown that even
aligned language models are susceptible to adversarial attacks that bypass
their restrictions on generating harmful text. We propose a simple approach to
defending against these attacks by having a large language model filter its own
responses. Our current results show that even if a model is not fine-tuned to
be aligned with human values, it is possible to stop it from presenting harmful
content to users by validating the content using a language model.

通过使用语言模型验证内容，我们提出了一种简单的方法来防御对抗性攻击，从而使大型语言模型过滤其自己的回应，即使模型未经人类价值重新调整，也可以避免为用户呈现有害内容。