Instruction-tuned Large Language Models (LLMs) have exhibited impressive
language understanding and the capacity to generate responses that follow
specific instructions. However, due to the computational demands associated
with training these models, their applications often rely on zero-shot
settings. In this paper, we evaluate the zero-shot performance of two publicly
accessible LLMs, ChatGPT and OpenAssistant, in the context of Computational
Social Science classification tasks, while also investigating the effects of
various prompting strategies. Our experiment considers the impact of prompt
complexity, including the effect of incorporating label definitions into the
prompt, using synonyms for label names, and the influence of integrating past
memories during the foundation model training. The findings indicate that in a
zero-shot setting, the current LLMs are unable to match the performance of
smaller, fine-tuned baseline transformer models (such as BERT). Additionally,
we find that different prompting strategies can significantly affect
classification accuracy, with variations in accuracy and F1 scores exceeding
10%.

在计算社会科学分类任务中，评估了 ChatGPT 和 OpenAssistant 两种公共可访问的 LLM 的零次效果，并研究了各种提示策略的影响。发现在零次设置下，当前 LLMs 无法与较小的经过微调的基线变压器模型（如 BERT）的性能匹配。此外，发现不同的提示策略可以显着影响分类准确性，准确性和 F1 分数的差异超过 10％。

零样本分类中的提示复杂度导航：计算社会科学中大型语言模型的研究

Navigating Prompt Complexity for Zero-Shot Classification: A Study of  Large Language Models in Computational Social Science

``Theory of Mind" (ToM) is the ability to understand human thinking and
decision-making, an ability that plays a crucial role in many types of social
interaction between people, including linguistic communication. This paper
investigates to what extent recent Large Language Models in the ChatGPT
tradition possess ToM. Focussing on six well-known ToM problems, we posed each
problem to two versions of ChatGPT and compared the results under a range of
prompting strategies. While the results concerning ChatGPT-3 were somewhat
inconclusive, ChatGPT-4 was shown to arrive at the correct answers more often
than would be expected based on chance, although correct answers were often
arrived at on the basis of false assumptions or invalid reasoning.

本文研究了最近在 ChatGPT 传统中的大型语言模型是否具有人的思维和决策能力，通过六个著名的心思想问题的测试，结果发现 ChatGPT-4 相对于 ChatGPT-3 正确率更高，尽管具有错误估计或无效推理等不足之处。