In this paper, I describe methodological considerations for studies that aim to evaluate the cognitive capacities of large language models (LLMs) using language-based behavioral assessments. Drawing on three case studies from the literature (a commonsense knowledge benchmark, a theory of mind evaluation, and a test of syntactic agreement), I describe common pitfalls that might arise when applying a cognitive test to an LLM. I then list 10 do's and don'ts that should help design high-quality cognitive evaluations for AI systems. I conclude by discussing four areas where the do's and don'ts are currently under active discussion -- prompt sensitivity, cultural and linguistic diversity, using LLMs as research assistants, and running evaluations on open vs. closed LLMs. Overall, the goal of the paper is to contribute to the broader discussion of best practices in the rapidly growing field of AI Psychology.

本文描述了评估使用基于语言的行为评估方法来评估大型语言模型（LLMs）认知能力的研究方法考虑因素。作者通过三个案例研究（常识知识基准、心理理论评估和语法一致性测试）描述了在将认知测试应用于LLM时可能出现的常见问题。作者还列出了10个应避免和遵循的指导方针，以帮助设计高质量的人工智能系统的认知评估。最后讨论了当前正在讨论的四个领域 - 提示的敏感性、文化和语言多样性、使用LLMs作为研究助理、以及对开放和封闭LLMs进行评估。总之，本文旨在为快速发展的AI心理学领域中的最佳实践做出贡献。

运行大型语言模型上的认知评估：要注意的事项和不要做的事项