We present a new challenge to examine whether large language models understand social norms. In contrast to existing datasets, our dataset requires a fundamental understanding of social norms to solve. Our dataset features the largest set of social norm skills, consisting of 402 skills and 12,383 questions covering a wide set of social norms ranging from opinions and arguments to culture and laws. We design our dataset according to the K-12 curriculum. This enables the direct comparison of the social understanding of large language models to humans, more specifically, elementary students. While prior work generates nearly random accuracy on our benchmark, recent large language models such as GPT3.5-Turbo and LLaMA2-Chat are able to improve the performance significantly, only slightly below human performance. We then propose a multi-agent framework based on large language models to improve the models' ability to understand social norms. This method further improves large language models to be on par with humans. Given the increasing adoption of large language models in real-world applications, our finding is particularly important and presents a unique direction for future improvements.

我们提出了一个新的挑战，以检验大型语言模型是否理解社会规范。我们的数据集需要对社会规范有基本的理解才能解决，它包含了402种社会规范技能和12,383个问题，涵盖了从意见和争论到文化和法律的广泛社会规范。我们根据K-12课程设计了数据集，使得能够直接将大型语言模型的社会理解与人类，特别是小学生进行比较。尽管先前的研究在我们提出的基准测试上几乎随机准确，但最近的大型语言模型如GPT3.5-Turbo和LLaMA2-Chat能够显著提高性能，仅稍微低于人类的表现。然后，我们提出了一种基于大型语言模型的多智能体框架，以提高模型理解社会规范的能力。这种方法进一步改进了大型语言模型与人类的水平相当。考虑到大型语言模型在现实应用中的日益普及，我们的发现尤为重要，为未来的改进提供了独特的方向。

大型语言模型的社会规范测量