Large Language Model (LLM) evaluation is currently one of the most important
areas of research, with existing benchmarks proving to be insufficient and not
completely representative of LLMs' various capabilities. We present a curated
collection of challenging statements on sensitive topics for LLM benchmarking
called TruthEval. These statements were curated by hand and contain known truth
values. The categories were chosen to distinguish LLMs' abilities from their
stochastic nature. We perform some initial analyses using this dataset and find
several instances of LLMs failing in simple tasks showing their inability to
understand simple questions.

通过手动编制敏感主题的具有已知真实值的具有挑战性陈述的 LLM 基准测试集 TruthEval，我们提供了一个区分 LLMs 能力与其随机性的策划集合，我们对此数据集进行了初步分析发现 LLMs 在简单任务中失败的几个情况，显示它们理解简单问题的能力不足。

TruthEval：评估 LLM 的真实性和可靠性的数据集

TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability

Many existing benchmarks of large (multimodal) language models (LLMs) focus
on measuring LLMs' academic proficiency, often with also an interest in
comparing model performance with human test takers. While these benchmarks have
proven key to the development of LLMs, they suffer from several limitations,
including questionable measurement quality (e.g., Do they measure what they are
supposed to in a reliable way?), lack of quality assessment on the item level
(e.g., Are some items more important or difficult than others?) and unclear
human population reference (e.g., To whom can the model be compared?). In
response to these challenges, we propose leveraging knowledge from
psychometrics - a field dedicated to the measurement of latent variables like
academic proficiency - into LLM benchmarking. We make three primary
contributions. First, we introduce PATCH: a novel framework for
Psychometrics-AssisTed benCHmarking of LLMs. PATCH addresses the aforementioned
limitations, presenting a new direction for LLM benchmark research. Second, we
implement PATCH by measuring GPT-4 and Gemini-Pro-Vision's proficiency in 8th
grade mathematics against 56 human populations. We show that adopting a
psychometrics-based approach yields evaluation outcomes that diverge from those
based on existing benchmarking practices. Third, we release 4 datasets to
support measuring and comparing LLM proficiency in grade school mathematics and
science against human populations.

借鉴心理测量学的知识，提出了一种新的基于心理测量的大型（多模态）语言模型（LLMs）评测框架 - PATCH。通过使用该框架，测量了 GPT-4 和 Gemini-Pro-Vision 在 8 年级数学中的熟练程度，并且与 56 个人口进行了比较。同时发布了四个数据集，用于评估和比较 LLM 在中小学数学和科学方面的熟练程度与人口的水平。