Language models (LMs) are increasingly used as simulacra for people, yet their ability to match the distribution of views of a specific demographic group and be \textit{distributionally aligned} remains uncertain. This notion of distributional alignment is complex, as there is significant variation in the types of attributes that are simulated. Prior works have underexplored the role of three critical variables -- the question domain, steering method, and distribution expression method -- which motivates our contribution of a benchmark explicitly addressing these dimensions. We construct a dataset expanding beyond political values, create human baselines for this task, and evaluate the extent to which an LM can align with a particular group's opinion distribution to inform design choices of such simulation systems. Our analysis reveals open problems regarding if, and how, LMs can be used to simulate humans, and that LLMs can more accurately describe the opinion distribution than simulate such distributions.

本研究旨在解决大型语言模型在模拟特定人群意见分布方面的不足，特别是在尚未充分探讨的问题域、引导方法和分布表达方法三大变量。我们构建了一个超越政治价值观的数据集，并建立了人类基准，通过评估语言模型与特定群体意见分布的对齐程度，揭示了在模拟人类方面的开放性问题，并发现大型语言模型在描述意见分布方面的表现优于模拟。 

大型语言模型的分布对齐基准测试