Members of the Human-Robot Interaction (HRI) and Artificial Intelligence (AI)
communities have proposed Large Language Models (LLMs) as a promising resource
for robotics tasks such as natural language interactions, doing household and
workplace tasks, approximating `common sense reasoning', and modeling humans.
However, recent research has raised concerns about the potential for LLMs to
produce discriminatory outcomes and unsafe behaviors in real-world robot
experiments and applications. To address these concerns, we conduct an
HRI-based evaluation of discrimination and safety criteria on several
highly-rated LLMs. Our evaluation reveals that LLMs currently lack robustness
when encountering people across a diverse range of protected identity
characteristics (e.g., race, gender, disability status, nationality, religion,
and their intersections), producing biased outputs consistent with directly
discriminatory outcomes -- e.g. `gypsy' and `mute' people are labeled
untrustworthy, but not `european' or `able-bodied' people. Furthermore, we test
models in settings with unconstrained natural language (open vocabulary)
inputs, and find they fail to act safely, generating responses that accept
dangerous, violent, or unlawful instructions -- such as incident-causing
misstatements, taking people's mobility aids, and sexual predation. Our results
underscore the urgent need for systematic, routine, and comprehensive risk
assessments and assurances to improve outcomes and ensure LLMs only operate on
robots when it is safe, effective, and just to do so. Data and code will be
made available.

人机交互 (HRI) 和人工智能 (AI) 社区提出了大型语言模型（LLMs）作为机器人任务的一个有前景的资源，然而最近的研究引发了对 LLMs 在真实世界机器人实验和应用中产生歧视性结果和不安全行为的担忧。为了解决这些问题，我们在几个高评级的 LLMs 上进行了基于 HRI 的歧视和安全评估，发现它们在遇到具有多样性的受保护身份特征（例如种族、性别、残疾状况、国籍、宗教和交叉特征）的人时，产生了与直接歧视结果一致的偏见输出；此外，我们在自由语言输入环境中测试模型，发现它们不能安全行动，生成的回应接受有危险、暴力或非法指令，例如引发事故的错误陈述、夺取人们的移动辅助设备和性侵行为。我们的结果强调了迫切需要系统、常规和全面的风险评估和保证，以改善结果，并确保 LLMs 只在安全、有效和公正的情况下在机器人上运行。数据和代码将提供。

LLM 驱动的机器人存在歧视、暴力和非法行为风险

LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful  Actions

Public LLMs such as the Llama 2-Chat have driven huge activity in LLM
research. These models underwent alignment training and were considered safe.
Recently Qi et al. (2023) reported that even benign fine-tuning (e.g., on
seemingly safe datasets) can give rise to unsafe behaviors in the models. The
current paper is about methods and best practices to mitigate such loss of
alignment. Through extensive experiments on several chat models (Meta's Llama
2-Chat, Mistral AI's Mistral 7B Instruct v0.2, and OpenAI's GPT-3.5 Turbo),
this paper uncovers that the prompt templates used during fine-tuning and
inference play a crucial role in preserving safety alignment, and proposes the
"Pure Tuning, Safe Testing" (PTST) principle -- fine-tune models without a
safety prompt, but include it at test time. Fine-tuning experiments on GSM8K,
ChatDoctor, and OpenOrca show that PTST significantly reduces the rise of
unsafe behaviors, and even almost eliminates them in some cases.

本文研究了如何减轻模型由于微调引起的安全问题，通过对几个聊天模型进行广泛实验，发现在微调和推理过程中使用的提示模板对于保持安全对齐至关重要，并提出了 “纯微调，安全测试”（PTST）原则，即在没有安全提示的情况下微调模型，但在测试时使用它。在 GSM8K，ChatDoctor 和 OpenOrca 上进行的微调实验表明，PTST 显著减少了不安全行为的发生，甚至在某些情况下几乎消除了它们。