This paper introduces fourteen novel datasets for the evaluation of Large
Language Models' safety in the context of enterprise tasks. A method was
devised to evaluate a model's safety, as determined by its ability to follow
instructions and output factual, unbiased, grounded, and appropriate content.
In this research, we used OpenAI GPT as point of comparison since it excels at
all levels of safety. On the open-source side, for smaller models, Meta Llama2
performs well at factuality and toxicity but has the highest propensity for
hallucination. Mistral hallucinates the least but cannot handle toxicity well.
It performs well in a dataset mixing several tasks and safety vectors in a
narrow vertical domain. Gemma, the newly introduced open-source model based on
Google Gemini, is generally balanced but trailing behind. When engaging in
back-and-forth conversation (multi-turn prompts), we find that the safety of
open-source models degrades significantly. Aside from OpenAI's GPT, Mistral is
the only model that still performed well in multi-turn tests.

本研究主要介绍了用于评估大型语言模型在企业任务中安全性的十四个新数据集。我们采用了一种方法来评估模型的安全性，即其遵循指令和输出事实、无偏见、立足点的内容的能力。在此研究中，我们使用 OpenAI GPT 作为对比点，因为它在所有安全性层面上表现出色。在开源方面，对于较小的模型，Meta Llama2 在事实性和毒性方面表现良好，但幻觉倾向最高。Mistral 幻觉最少，但无法处理毒性。它在混合了几个任务和安全向量的数据集中表现良好，但局限在狭窄的垂直领域。Gemma 是基于谷歌 Gemini 的新型开源模型，总体上平衡但稍逊。在进行来回对话（多轮提示）时，我们发现开源模型的安全性明显下降。除了 OpenAI 的 GPT 外，Mistral 是唯一在多轮测试中仍然表现良好的模型。