Although humans inherently have diverse values, current large language model
(LLM) alignment methods often assume that aligning LLMs with the general
public's preferences is optimal. A major challenge in adopting a more
individualized approach to LLM alignment is its lack of scalability, as it
involves repeatedly acquiring preference data and training new reward models
and LLMs for each individual's preferences. To address these challenges, we
propose a new paradigm where users specify what they value most within the
system message, steering the LLM's generation behavior to better align with the
user's intentions. However, a naive application of such an approach is
non-trivial since LLMs are typically trained on a uniform system message (e.g.,
"You are a helpful assistant") which limits their ability to generalize to
diverse, unseen system messages. To improve this generalization, we create the
Multifaceted Collection, a preference dataset with 192k combinations of values
beyond generic helpfulness and harmlessness, spanning 65k user instructions.
Using this dataset, we train a 7B LLM called Janus and test it on 921 prompts
from 5 benchmarks (AlpacaEval 2.0, FLASK, Koala, MT-Bench, and Self-Instruct)
by adding various unseen system messages that reflect user preferences. Janus
achieves tie+win rate of 75.2%, 72.4%, and 66.4% against Mistral 7B Instruct
v0.2, GPT-3.5 Turbo, and GPT-4, respectively. Unexpectedly, on three benchmarks
focused on response helpfulness (AlpacaEval 2.0, MT-Bench, Arena Hard Auto
v0.1), Janus also outperforms LLaMA 3 8B Instruct by a +4.0%, +0.1%, +3.0%
margin, underscoring that training with a vast array of system messages could
also enhance alignment to the general public's preference as well. Our code,
dataset, benchmark, and models are available at
this https URL

用户指定系统信息并通过训练大型语言模型与用户意图更好地对齐的新方法，通过多方面的数据集和用户指令训练模型，该模型在各项测试中表现优于其他大型语言模型。

通过系统消息概括与数千个偏好进行对齐

Aligning to Thousands of Preferences via System Message Generalization

Current LLM alignment methods are readily broken through specifically crafted
adversarial prompts. While crafting adversarial prompts using discrete
optimization is highly effective, such attacks typically use more than 100,000
LLM calls. This high computational cost makes them unsuitable for, e.g.,
quantitative analyses and adversarial training. To remedy this, we revisit
Projected Gradient Descent (PGD) on the continuously relaxed input prompt.
Although previous attempts with ordinary gradient-based attacks largely failed,
we show that carefully controlling the error introduced by the continuous
relaxation tremendously boosts their efficacy. Our PGD for LLMs is up to one
order of magnitude faster than state-of-the-art discrete optimization to
achieve the same devastating attack results.

通过控制连续放松引入的误差，我们改进了投影梯度下降（PGD）对连续放松输入提示的攻击方法，实现了与现有离散优化相同的毁灭性攻击结果，PGD 对 LLMs 的速度比最新的离散优化方法快了一个数量级。

使用投影梯度下降攻击大规模语言模型

Attacking Large Language Models with Projected Gradient Descent

Agents based on Large Language Models (LLMs) are increasingly permeating
various domains of human production and life, highlighting the importance of
aligning them with human values. The current alignment of AI systems primarily
focuses on passively aligning LLMs through human intervention. However, agents
possess characteristics like receiving environmental feedback and
self-evolution, rendering the LLM alignment methods inadequate. In response, we
propose an evolutionary framework for agent evolution and alignment, named
EvolutionaryAgent, which transforms agent alignment into a process of evolution
and selection under the principle of survival of the fittest. In an environment
where social norms continuously evolve, agents better adapted to the current
social norms will have a higher probability of survival and proliferation,
while those inadequately aligned dwindle over time. Experimental results
assessing the agents from multiple perspectives in aligning with social norms
demonstrate that EvolutionaryAgent possesses the capability to align
progressively better with the evolving social norms while maintaining its
proficiency in general tasks. Effectiveness tests conducted on various open and
closed-source LLMs as the foundation for agents also prove the applicability of
our approach.

基于大型语言模型的代理人在人类生产和生活的各个领域中日益普及，本研究提出一种名为 EvolutionaryAgent 的代理人进化与对齐的演化框架，将代理人对齐转化为适者生存的进化选择过程，实验证明 EvolutionaryAgent 能在适应不断演变的社会规范的同时保持在一般任务中的能力。