The era of large language models (LLM) raises questions not only about how to train models, but also about how to evaluate them. Despite numerous existing benchmarks, insufficient attention is often given to creating assessments that test LLMs in a valid and reliable manner. To address this challenge, we accommodate the Evidence-centered design (ECD) methodology and propose a comprehensive approach to benchmark development based on rigorous psychometric principles. In this paper, we have made the first attempt to illustrate this approach by creating a new benchmark in the field of pedagogy and education, highlighting the limitations of existing benchmark development approach and taking into account the development of LLMs. We conclude that a new approach to benchmarking is required to match the growing complexity of AI applications in the educational context. We construct a novel benchmark guided by the Bloom's taxonomy and rigorously designed by a consortium of education experts trained in test development. Thus the current benchmark provides an academically robust and practical assessment tool tailored for LLMs, rather than human participants. Tested empirically on the GPT model in the Russian language, it evaluates model performance across varied task complexities, revealing critical gaps in current LLM capabilities. Our results indicate that while generative AI tools hold significant promise for education - potentially supporting tasks such as personalized tutoring, real-time feedback, and multilingual learning - their reliability as autonomous teachers' assistants right now remain rather limited, particularly in tasks requiring deeper cognitive engagement.

本研究解决了当前大型语言模型（LLM）评估方法的有效性和可靠性不足的问题。通过采用以证据为中心的设计（ECD）方法论，文章提出了一种基于心理测量学原理的全新基准开发方法，并强调现有基准的局限性。研究结果表明，尽管生成式人工智能工具在教育中展现出巨大潜力，但在需要更深层次认知参与的任务中，其作为自主教师助手的可靠性仍然有限。

基于心理测量学的新方法开发大型语言模型的职业能力基准