The programming capabilities of large language models (LLMs) have revolutionized automatic code generation and opened new avenues for automatic statistical analysis. However, the validity and quality of these generated codes need to be systematically evaluated before they can be widely adopted. Despite their growing prominence, a comprehensive evaluation of statistical code generated by LLMs remains scarce in the literature. In this paper, we assess the performance of LLMs, including two versions of ChatGPT and one version of Llama, in the domain of SAS programming for statistical analysis. Our study utilizes a set of statistical analysis tasks encompassing diverse statistical topics and datasets. Each task includes a problem description, dataset information, and human-verified SAS code. We conduct a comprehensive assessment of the quality of SAS code generated by LLMs through human expert evaluation based on correctness, effectiveness, readability, executability, and the accuracy of output results. The analysis of rating scores reveals that while LLMs demonstrate usefulness in generating syntactically correct code, they struggle with tasks requiring deep domain understanding and may produce redundant or incorrect results. This study offers valuable insights into the capabilities and limitations of LLMs in statistical programming, providing guidance for future advancements in AI-assisted coding systems for statistical analysis.

本研究探讨了大型语言模型（LLMs）在生成统计分析代码方面的有效性和质量，填补了文献中对这一领域的系统评估缺口。通过对ChatGPT和Llama的不同版本在SAS编程任务中的表现进行评估，研究发现尽管LLMs能够生成语法正确的代码，但在深层领域理解和结果准确性方面存在不足。这项研究为未来AI辅助编程系统在统计分析中的进展提供了指导。

大型语言模型在统计编程中的性能评估