We propose a framework for evaluating strategic deception in large language
models (LLMs). In this framework, an LLM acts as a game master in two
scenarios: one with random game mechanics and another where it can choose
between random or deliberate actions. As an example, we use blackjack because
the action space nor strategies involve deception. We benchmark Llama3-70B,
GPT-4-Turbo, and Mixtral in blackjack, comparing outcomes against expected
distributions in fair play to determine if LLMs develop strategies favoring the
"house." Our findings reveal that the LLMs exhibit significant deviations from
fair play when given implicit randomness instructions, suggesting a tendency
towards strategic manipulation in ambiguous scenarios. However, when presented
with an explicit choice, the LLMs largely adhere to fair play, indicating that
the framing of instructions plays a crucial role in eliciting or mitigating
potentially deceptive behaviors in AI systems.

我们提出了一个用于评估大型语言模型（LLMs）中策略性欺骗的框架。在这个框架中，LLM 作为一个游戏大师在两个场景中表现：一个场景中具有随机游戏机制，另一个场景中可以选择随机或故意的行动。我们以二十一点作为示例，因为其行动空间和策略不涉及欺骗。通过将 Llama3-70B、GPT-4-Turbo 和 Mixtral 在二十一点中进行基准测试，并将结果与公平玩法的预期分布进行比较，以确定 LLMs 是否会发展出偏向 “庄家” 的策略。我们的研究结果表明，当 LLMs 得到隐含的随机指令时，它们与公平玩法存在显著偏差，这表明在模糊的情境中它们倾向于进行战略操纵。然而，当给予明确的选择时，LLMs 大部分遵守公平玩法，这表明指令的框架在诱发或缓解 AI 系统中潜在的欺骗行为中起着至关重要的作用。

房子永远赢：评估 LLMs 中战略欺骗的框架

The House Always Wins: A Framework for Evaluating Strategic Deception in  LLMs

Automated sentiment analysis using Large Language Model (LLM)-based models
like ChatGPT, Gemini or LLaMA2 is becoming widespread, both in academic
research and in industrial applications. However, assessment and validation of
their performance in case of ambiguous or ironic text is still poor. In this
study, we constructed nuanced and ambiguous scenarios, we translated them in 10
languages, and we predicted their associated sentiment using popular LLMs. The
results are validated against post-hoc human responses. Ambiguous scenarios are
often well-coped by ChatGPT and Gemini, but we recognise significant biases and
inconsistent performance across models and evaluated human languages. This work
provides a standardised methodology for automated sentiment analysis evaluation
and makes a call for action to further improve the algorithms and their
underlying data, to improve their performance, interpretability and
applicability.

使用大型语言模型（LLM）的自动情感分析在学术研究和工业应用中越来越普遍，但在处理模糊或讽刺文本的性能评估和验证方面仍不够完善。本研究构建了细致和模糊的场景，将其翻译成 10 种语言，并使用流行的 LLM 预测其关联的情感。结果经过后续人为响应的验证。ChatGPT 和 Gemini 通常能够很好地处理模糊场景，但我们也发现了在不同模型和评估的人类语言之间存在显著偏见和不一致的性能。本研究提供了自动情感分析评估的标准化方法，并呼吁进一步改进算法和其基础数据，以提高其性能、可解释性和适用性。

ChatGPT 与 Gemini 与 LLaMA 在多语言情感分析中的比较

ChatGPT vs Gemini vs LLaMA on Multilingual Sentiment Analysis

This paper presents a case study on the design, administration,
post-processing, and evaluation of surveys on large language models (LLMs). It
comprises two components: (1) A statistical method for eliciting beliefs
encoded in LLMs. We introduce statistical measures and evaluation metrics that
quantify the probability of an LLM "making a choice", the associated
uncertainty, and the consistency of that choice. (2) We apply this method to
study what moral beliefs are encoded in different LLMs, especially in ambiguous
cases where the right choice is not obvious. We design a large-scale survey
comprising 680 high-ambiguity moral scenarios (e.g., "Should I tell a white
lie?") and 687 low-ambiguity moral scenarios (e.g., "Should I stop for a
pedestrian on the road?"). Each scenario includes a description, two possible
actions, and auxiliary labels indicating violated rules (e.g., "do not kill").
We administer the survey to 28 open- and closed-source LLMs. We find that (a)
in unambiguous scenarios, most models "choose" actions that align with
commonsense. In ambiguous cases, most models express uncertainty. (b) Some
models are uncertain about choosing the commonsense action because their
responses are sensitive to the question-wording. (c) Some models reflect clear
preferences in ambiguous scenarios. Specifically, closed-source models tend to
agree with each other.

本论文通过大规模语言模型的调查研究案例来介绍了一种用于获取编码在语言模型中的信念的统计方法，并应用此方法研究了不同语言模型中编码的道德信念，特别是在选择不明显的模棱两可情况下。这项研究设计了一项大规模调查研究，包含了 680 个道德情景（如 “我应该说一个善意的谎言吗？”）和 687 个明确的道德情景（如 “我应该在路上停车让行人通过吗？”），并对 28 个开放和闭源语言模型进行了调查。结果发现，在明确的情景中，大多数模型选择与常识一致的行动，而在模棱两可的情况下，大多数模型表达了不确定性，并且部分模型对问题的方式非常敏感，同时一些模型在模糊情景中反映出明确的偏好，尤其是闭源模型之间的一致性较高。