The versatility of Large Language Models (LLMs) on natural language understanding tasks has made them popular for research in social sciences. In particular, to properly understand the properties and innate personas of LLMs, researchers have performed studies that involve using prompts in the form of questions that ask LLMs of particular opinions. In this study, we take a cautionary step back and examine whether the current format of prompting enables LLMs to provide responses in a consistent and robust manner. We first construct a dataset that contains 693 questions encompassing 39 different instruments of persona measurement on 115 persona axes. Additionally, we design a set of prompts containing minor variations and examine LLM's capabilities to generate accurate answers, as well as consistency variations to examine their consistency towards simple perturbations such as switching the option order. Our experiments on 15 different open-source LLMs reveal that even simple perturbations are sufficient to significantly downgrade a model's question-answering ability, and that most LLMs have low negation consistency. Our results suggest that the currently widespread practice of prompting is insufficient to accurately capture model perceptions, and we discuss potential alternatives to improve such issues.

大型语言模型（LLMs）在社会科学研究中的自然语言理解任务的通用性使其备受青睐。本研究探讨了当前的提示格式是否能使LLMs以一致且稳健的方式提供回答，结论发现即使对选项顺序进行简单扰动也足以显著降低模型的问答能力，且大多数LLMs在否定一致性方面表现低下，提示目前的普遍做法无法准确捕捉模型的认知，我们讨论了改进这些问题的可能替代方案。

You don't need a personality test to know these models are unreliable:
  Assessing the Reliability of Large Language Models on Psychometric
  Instruments

评估大型语言模型在心理测量工具上的可靠性