As the scaling of Large Language Models (LLMs) has dramatically enhanced their capabilities, there has been a growing focus on the alignment problem to ensure their responsible and ethical use. While existing alignment efforts predominantly concentrate on universal values such as the HHH principle, the aspect of culture, which is inherently pluralistic and diverse, has not received adequate attention. This work introduces a new benchmark, CDEval, aimed at evaluating the cultural dimensions of LLMs. CDEval is constructed by incorporating both GPT-4's automated generation and human verification, covering six cultural dimensions across seven domains. Our comprehensive experiments provide intriguing insights into the culture of mainstream LLMs, highlighting both consistencies and variations across different dimensions and domains. The findings underscore the importance of integrating cultural considerations in LLM development, particularly for applications in diverse cultural settings. Through CDEval, we aim to broaden the horizon of LLM alignment research by including cultural dimensions, thus providing a more holistic framework for the future development and evaluation of LLMs. This benchmark serves as a valuable resource for cultural studies in LLMs, paving the way for more culturally aware and sensitive models.

利用GPT-4自动生成并通过人工验证的方式，我们构建了一个评估LLMs文化维度的新基准，CDEval。通过研究主流LLMs的文化方面，我们得出了一些有趣的结论，强调了在LLM开发中整合文化考量的重要性，特别是在多元文化环境中的应用。通过CDEval，我们旨在为LLM的未来发展和评估提供一个更全面的框架，为文化研究提供宝贵的资源，为构建更具文化意识和敏感性的模型铺平道路。

CDEval：评估大型语言模型文化维度的基准