Large language models (LLMs) have been shown to perform well at a variety of syntactic, discourse, and reasoning tasks. While LLMs are increasingly deployed in many forms including conversational agents that interact with humans, we lack a grounded benchmark to measure how well LLMs understand \textit{social} language. Here, we introduce a new theory-driven benchmark, SocKET, that contains 58 NLP tasks testing social knowledge which we group into five categories: humor & sarcasm, offensiveness, sentiment & emotion, and trustworthiness. In tests on the benchmark, we demonstrate that current models attain only moderate performance but reveal significant potential for task transfer among different types and categories of tasks, which were predicted from theory. Through zero-shot evaluations, we show that pretrained models already possess some innate but limited capabilities of social language understanding and training on one category of tasks can improve zero-shot testing on others. Our benchmark provides a systematic way to analyze model performance on an important dimension of language and points to clear room for improvement to build more socially-aware LLMs. The associated resources are released at https://github.com/minjechoi/SOCKET.

介绍了一种名为SocKET的新理论驱动基准来测试大型语言模型在社交语言理解方面的性能，结果表明当前模型表现中等，但是存在不同类型和类别任务之间的任务转移潜力，同时使用零样本评估方法揭示了预训练模型已经具备了对社交语言理解的某些固有能力，这个基准提供了系统性的方式来分析模型在语言的重要维度上的性能，为构建更加符合社交意识的大型语言模型提供了指导。

LLMs是否理解社交知识？使用SocKET基准评估大型语言模型的社交能力