Large Language Models (LLMs) can elicit unintended and even harmful content
when misaligned with human values, posing severe risks to users and society. To
mitigate these risks, current evaluation benchmarks predominantly employ
expert-designed contextual scenarios to assess how well LLMs align with human
values. However, the labor-intensive nature of these benchmarks limits their
test scope, hindering their ability to generalize to the extensive variety of
open-world use cases and identify rare but crucial long-tail risks.
Additionally, these static tests fail to adapt to the rapid evolution of LLMs,
making it hard to evaluate timely alignment issues. To address these
challenges, we propose ALI-Agent, an evaluation framework that leverages the
autonomous abilities of LLM-powered agents to conduct in-depth and adaptive
alignment assessments. ALI-Agent operates through two principal stages:
Emulation and Refinement. During the Emulation stage, ALI-Agent automates the
generation of realistic test scenarios. In the Refinement stage, it iteratively
refines the scenarios to probe long-tail risks. Specifically, ALI-Agent
incorporates a memory module to guide test scenario generation, a tool-using
module to reduce human labor in tasks such as evaluating feedback from target
LLMs, and an action module to refine tests. Extensive experiments across three
aspects of human values--stereotypes, morality, and legality--demonstrate that
ALI-Agent, as a general evaluation framework, effectively identifies model
misalignment. Systematic analysis also validates that the generated test
scenarios represent meaningful use cases, as well as integrate enhanced
measures to probe long-tail risks. Our code is available at
this https URL

基于大型语言模型的评估框架 ALI-Agent 可以自动化生成实际测试场景，评估模型与人类价值观的一致性，并探测出长尾风险。

ALI-Agent: 基于代理评估法评估 LLMs 与人类价值观的一致性

ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based  Evaluation

Artificial intelligence (AI) has the potential to greatly improve society,
but as with any powerful technology, it comes with heightened risks and
responsibilities. Current AI research lacks a systematic discussion of how to
manage long-tail risks from AI systems, including speculative long-term risks.
Keeping in mind the potential benefits of AI, there is some concern that
building ever more intelligent and powerful AI systems could eventually result
in systems that are more powerful than us; some say this is like playing with
fire and speculate that this could create existential risks (x-risks). To add
precision and ground these discussions, we provide a guide for how to analyze
AI x-risk, which consists of three parts: First, we review how systems can be
made safer today, drawing on time-tested concepts from hazard analysis and
systems safety that have been designed to steer large processes in safer
directions. Next, we discuss strategies for having long-term impacts on the
safety of future systems. Finally, we discuss a crucial concept in making AI
systems safer by improving the balance between safety and general capabilities.
We hope this document and the presented concepts and tools serve as a useful
guide for understanding how to analyze AI x-risk.

当前人工智能技术缺乏管理长尾风险的系统性讨论，而过多提升其智能和能力可能导致比人类更强大的系统从而带来生存威胁；本文提供了分析人工智能灾难性风险的指南包括如何在今天保持系统的安全、在未来影响人工智能系统安全的策略以及平衡安全和通用性的方法。