Large Language Models (LLMs) can elicit unintended and even harmful content
when misaligned with human values, posing severe risks to users and society. To
mitigate these risks, current evaluation benchmarks predominantly employ
expert-designed contextual scenarios to assess how well LLMs align with human
values. However, the labor-intensive nature of these benchmarks limits their
test scope, hindering their ability to generalize to the extensive variety of
open-world use cases and identify rare but crucial long-tail risks.
Additionally, these static tests fail to adapt to the rapid evolution of LLMs,
making it hard to evaluate timely alignment issues. To address these
challenges, we propose ALI-Agent, an evaluation framework that leverages the
autonomous abilities of LLM-powered agents to conduct in-depth and adaptive
alignment assessments. ALI-Agent operates through two principal stages:
Emulation and Refinement. During the Emulation stage, ALI-Agent automates the
generation of realistic test scenarios. In the Refinement stage, it iteratively
refines the scenarios to probe long-tail risks. Specifically, ALI-Agent
incorporates a memory module to guide test scenario generation, a tool-using
module to reduce human labor in tasks such as evaluating feedback from target
LLMs, and an action module to refine tests. Extensive experiments across three
aspects of human values--stereotypes, morality, and legality--demonstrate that
ALI-Agent, as a general evaluation framework, effectively identifies model
misalignment. Systematic analysis also validates that the generated test
scenarios represent meaningful use cases, as well as integrate enhanced
measures to probe long-tail risks. Our code is available at
this https URL

基于大型语言模型的评估框架 ALI-Agent 可以自动化生成实际测试场景，评估模型与人类价值观的一致性，并探测出长尾风险。

ALI-Agent: 基于代理评估法评估 LLMs 与人类价值观的一致性

ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based  Evaluation

Transient or permanent faults in hardware can render the output of Neural
Networks (NN) incorrect without user-specific traces of the error, i.e. silent
data errors (SDE). On the other hand, modern NNs also possess an inherent
redundancy that can tolerate specific faults. To establish a safety case, it is
necessary to distinguish and quantify both types of corruptions. To study the
effects of hardware (HW) faults on software (SW) in general and NN models in
particular, several fault injection (FI) methods have been established in
recent years. Current FI methods focus on the methodology of injecting faults
but often fall short of accounting for large-scale FI tests, where many fault
locations based on a particular fault model need to be analyzed in a short
time. Results need to be concise, repeatable, and comparable. To address these
requirements and enable fault injection as the default component in a machine
learning development cycle, we introduce a novel fault injection framework
called PyTorchALFI (Application Level Fault Injection for PyTorch) based on
PyTorchFI. PyTorchALFI provides an efficient way to define randomly generated
and reusable sets of faults to inject into PyTorch models, defines complex test
scenarios, enhances data sets, and generates test KPIs while tightly coupling
fault-free, faulty, and modified NN. In this paper, we provide details about
the definition of test scenarios, software architecture, and several examples
of how to use the new framework to apply iterative changes in fault location
and number, compare different model modifications, and analyze test results.

使用 PyTorchALFI 框架，引入了故障注入方法用于测试神经网络模型，包括定义测试场景、软件架构和分析测试结果等内容。

PyTorch 模型的大规模故障注入应用──PyTorchFI 的扩展以提高验证效率

Large-Scale Application of Fault Injection into PyTorch Models -- an  Extension to PyTorchFI for Validation Efficiency

Existing evaluation suites for multi-agent reinforcement learning (MARL) do
not assess generalization to novel situations as their primary objective
(unlike supervised-learning benchmarks). Our contribution, Melting Pot, is a
MARL evaluation suite that fills this gap, and uses reinforcement learning to
reduce the human labor required to create novel test scenarios. This works
because one agent's behavior constitutes (part of) another agent's environment.
To demonstrate scalability, we have created over 80 unique test scenarios
covering a broad range of research topics such as social dilemmas, reciprocity,
resource sharing, and task partitioning. We apply these test scenarios to
standard MARL training algorithms, and demonstrate how Melting Pot reveals
weaknesses not apparent from training performance alone.

本论文提出了一种名为 Melting Pot 的 MARL 评估套件，旨在评估新情况下的泛化能力，并使用强化学习降低开发新测试场景所需的人力成本。该套件由 80 个测试场景组成，覆盖了社交困境、互惠、资源共享和任务划分等广泛的研究领域，通过应用这些测试场景到标准 MARL 训练算法中，揭示了不仅仅是训练表现的弱点。