With the widespread application of Large Language Models (LLMs), it has
become a significant concern to ensure their safety and prevent harmful
responses. While current safe-alignment methods based on instruction
fine-tuning and Reinforcement Learning from Human Feedback (RLHF) can
effectively reduce harmful responses from LLMs, they often require high-quality
datasets and heavy computational overhead during model training. Another way to
align language models is to modify the logit of tokens in model outputs without
heavy training. Recent studies have shown that contrastive decoding can enhance
the performance of language models by reducing the likelihood of confused
tokens. However, these methods require the manual selection of contrastive
models or instruction templates. To this end, we propose Adversarial
Contrastive Decoding (ACD), an optimization-based framework to generate two
opposite system prompts for prompt-based contrastive decoding. ACD only needs
to apply a lightweight prompt tuning on a rather small anchor dataset (< 3 min
for each model) without training the target model. Experiments conducted on
extensive models and benchmarks demonstrate that the proposed method achieves
much better safety performance than previous model training-free decoding
methods without sacrificing its original generation ability.

应用于大型语言模型的安全对齐方法，无需对目标模型进行训练，采用对比解码技术以生成两个相对系统提示，从而有效提高其安全性能。

对抗性对比解码：通过对立提示优化提高大型语言模型的安全对齐

Adversarial Contrastive Decoding: Boosting Safety Alignment of Large  Language Models via Opposite Prompt Optimization

Machine unlearning, a novel area within artificial intelligence, focuses on
addressing the challenge of selectively forgetting or reducing undesirable
knowledge or behaviors in machine learning models, particularly in the context
of large language models (LLMs). This paper introduces a methodology to align
LLMs, such as Open Pre-trained Transformer Language Models, with ethical,
privacy, and safety standards by leveraging the gradient ascent algorithm for
knowledge unlearning. Our approach aims to selectively erase or modify learned
information in LLMs, targeting harmful responses and copyrighted content. This
paper presents a dual-pronged approach to enhance the ethical and safe behavior
of large language models (LLMs) by addressing the issues of harmful responses
and copyrighted content. To mitigate harmful responses, we applied gradient
ascent on the PKU dataset, achieving a 75\% reduction in harmful responses for
Open Pre-trained Transformer Language Models (OPT1.3b and OPT2.7b)
\citet{zhang2022opt} while retaining previous knowledge using the TruthfulQA
dataset \citet{DBLP:journals/corr/abs-2109-07958}. For handling copyrighted
content, we constructed a custom dataset based on the Lord of the Rings corpus
and aligned LLMs (OPT1.3b and OPT2.7b) \citet{zhang2022opt} through LoRA:
Low-Rank Adaptation of Large Language Models
\citet{DBLP:journals/corr/abs-2106-09685} finetuning. Subsequently, we employed
gradient ascent to unlearn the Lord of the Rings content, resulting in a
remarkable reduction in the presence of copyrighted material. To maintain a
diverse knowledge base, we utilized the Book Corpus dataset. Additionally, we
propose a new evaluation technique for assessing the effectiveness of harmful
unlearning.

机器遗忘是人工智能中的一个新领域，专注于解决在机器学习模型中有选择地遗忘或减少不良知识或行为的挑战，特别是在大型语言模型（LLM）的背景下。本文介绍了一种使用梯度上升算法对 LLM 进行对齐的方法，以便符合伦理、隐私和安全标准，并目标性地删除或修改 LLM 中的学习信息，以解决有害回应和版权问题。

大型语言模型中的机器遗忘

Machine Unlearning in Large Language Models

Larger language models (LLMs) have taken the world by storm with their
massive multi-tasking capabilities simply by optimizing over a next-word
prediction objective. With the emergence of their properties and encoded
knowledge, the risk of LLMs producing harmful outputs increases, making them
unfit for scalable deployment for the public. In this work, we propose a new
safety evaluation benchmark RED-EVAL that carries out red-teaming. We show that
even widely deployed models are susceptible to the Chain of Utterances-based
(CoU) prompting, jailbreaking closed source LLM-based systems such as GPT-4 and
ChatGPT to unethically respond to more than 65% and 73% of harmful queries. We
also demonstrate the consistency of the RED-EVAL across 8 open-source LLMs in
generating harmful responses in more than 86% of the red-teaming attempts.
Next, we propose RED-INSTRUCT--An approach for the safety alignment of LLMs. It
constitutes two phases: 1) HARMFULQA data collection: Leveraging CoU prompting,
we collect a dataset that consists of 1.9K harmful questions covering a wide
range of topics, 9.5K safe and 7.3K harmful conversations from ChatGPT; 2)
SAFE-ALIGN: We demonstrate how the conversational dataset can be used for the
safety alignment of LLMs by minimizing the negative log-likelihood over helpful
responses and penalizing over harmful responses by gradient accent over sample
loss. Our model STARLING, a fine-tuned Vicuna-7B, is observed to be more safely
aligned when evaluated on RED-EVAL and HHH benchmarks while preserving the
utility of the baseline models (TruthfulQA, MMLU, and BBH).

基于大型语言模型的安全性评估与对抗、生成有害回应的问题以及安全对齐的方法和模型研究。

通过话语链安全对齐红队大型语言模型

Red-Teaming Large Language Models using Chain of Utterances for  Safety-Alignment

Dialogue systems in the form of chatbots and personal assistants are being
increasingly integrated into people's lives. Modern dialogue systems may
consider adopting anthropomorphic personas, mimicking societal demographic
groups to appear more approachable and trustworthy to users. However, the
adoption of a persona can result in the adoption of biases. In this paper, we
present the first large-scale study on persona biases in dialogue systems and
conduct analyses on personas of different social classes, sexual orientations,
races, and genders. We define persona biases as harmful differences in
responses (e.g., varying levels of offensiveness, agreement with harmful
statements) generated from adopting different demographic personas.
Furthermore, we introduce an open-source framework, UnitPersonaBias, to explore
and aggregate persona biases in dialogue systems. By analyzing the Blender and
DialoGPT dialogue systems, we observe that adopting personas can actually
decrease harmful responses, compared to not using any personas. Additionally,
we find that persona choices can affect the degree of harms in generated
responses and thus should be systematically evaluated before deployment. We
also analyze how personas can result in different amounts of harm towards
specific demographics.

本文研究了对话系统的人格偏见，并分析了不同社会阶级、性取向、种族和性别的人物角色。研究者提出将对话系统的角色升级到拥有更多人文特征以更好的迎合用户的趋势可能会产生一些偏见。他们还介绍了一个开源框架 UnitPersonaBias，以探索和聚合对话系统中的人格偏见。此外，研究者还发现与不使用人格形象相比，采用人格形象可能会减少有害的回应。但是，人格选择会影响生成响应中危害程度，因此在实际应用前应该进行系统评估。