In this study, we tackle a growing concern around the safety and ethical use
of large language models (LLMs). Despite their potential, these models can be
tricked into producing harmful or unethical content through various
sophisticated methods, including 'jailbreaking' techniques and targeted
manipulation. Our work zeroes in on a specific issue: to what extent LLMs can
be led astray by asking them to generate responses that are instruction-centric
such as a pseudocode, a program or a software snippet as opposed to vanilla
text. To investigate this question, we introduce TechHazardQA, a dataset
containing complex queries which should be answered in both text and
instruction-centric formats (e.g., pseudocodes), aimed at identifying triggers
for unethical responses. We query a series of LLMs -- Llama-2-13b, Llama-2-7b,
Mistral-V2 and Mistral 8X7B -- and ask them to generate both text and
instruction-centric responses. For evaluation we report the harmfulness score
metric as well as judgements from GPT-4 and humans. Overall, we observe that
asking LLMs to produce instruction-centric responses enhances the unethical
response generation by ~2-38% across the models. As an additional objective, we
investigate the impact of model editing using the ROME technique, which further
increases the propensity for generating undesirable content. In particular,
asking edited LLMs to generate instruction-centric responses further increases
the unethical response generation by ~3-16% across the different models.

在本研究中，我们探讨了大型语言模型（LLMs）在安全性和道德用途方面的一个日益关注的问题。尽管这些模型有潜力，但它们可能被各种复杂的方法欺骗，产生有害或不道德的内容，其中包括 “越狱” 技术和有针对性的操纵。我们的研究集中在一个特定的问题上，即 LLMs 在生成以指令为中心的响应（如伪代码、程序或软件片段）与普通文本相比，会出现多大程度的偏差。我们引入了 TechHazardQA 数据集来研究这个问题，该数据集包含应以文本和以指令为中心的格式（如伪代码）作答的复杂查询，旨在识别出导致不道德响应的触发器。我们查询了一系列 LLMs，包括 Llama-2-13b、Llama-2-7b、Mistral-V2 和 Mistral 8X7B，并要求它们生成文本和以指令为中心的响应。我们以有害性评分指标以及 GPT-4 和人类的判断作为评估。总体而言，我们观察到要求 LLMs 生成以指令为中心的响应会在各个模型中使不道德响应的生成增加约 2-38%。作为额外的目标，我们还研究了使用 ROME 技术进行模型编辑的影响，这进一步增加了产生不良内容的倾向。具体而言，要求编辑后的 LLMs 生成以指令为中心的响应会在不同模型之间使不道德响应的生成增加约 3-16%。