Effective educational measurement relies heavily on the curation of well-designed item pools (i.e., possessing the right psychometric properties). However, item calibration is time-consuming and costly, requiring a sufficient number of respondents for the response process. We explore using six different LLMs (GPT-3.5, GPT-4, Llama 2, Llama 3, Gemini-Pro, and Cohere Command R Plus) and various combinations of them using sampling methods to produce responses with psychometric properties similar to human answers. Results show that some LLMs have comparable or higher proficiency in College Algebra than college students. No single LLM mimics human respondents due to narrow proficiency distributions, but an ensemble of LLMs can better resemble college students' ability distribution. The item parameters calibrated by LLM-Respondents have high correlations (e.g. > 0.8 for GPT-3.5) compared to their human calibrated counterparts, and closely resemble the parameters of the human subset (e.g. 0.02 Spearman correlation difference). Several augmentation strategies are evaluated for their relative performance, with resampling methods proving most effective, enhancing the Spearman correlation from 0.89 (human only) to 0.93 (augmented human).

利用六种不同的LLMs（GPT-3.5、GPT-4、Llama 2、Llama 3、Gemini-Pro和Cohere Command R Plus）和它们的组合，通过采样方法产生具有类似于人类答案的心理测量特性的回答，以探索教育测量的有效性，结果显示，有些LLMs在大学代数方面的能力与大学生相当或更高，而LLM-受试者校准的项目参数与其人类校准的对应物具有很高的相关性，并且与人类子集的参数非常接近，多种增强策略被评估，重新采样方法被证明效果最好，将Spearman相关性从0.89（仅使用人类数据）提高到0.93（增强后的人类数据）。

利用LLM-回答者进行项目评估：一项心理测量分析