Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate language models' ability to quantify outcome uncertainty. In this work, we focus on the use of language models as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using large language models, and evaluate them against benchmark prediction tasks. Specifically, the package derives natural language tasks from US Census data products, inspired by popular tabular data benchmarks. A flexible API allows for any task to be constructed out of 28 census features whose values are mapped to prompt-completion pairs. We demonstrate the utility of folktexts through a sweep of empirical insights on 16 recent large language models, inspecting risk scores, calibration curves, and diverse evaluation metrics. We find that zero-shot risk sores have high predictive signal while being widely miscalibrated: base models overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and generate over-confident risk scores.

本研究解决了现有问答基准在评估语言模型量化结果不确定性方面的不足。我们引入了folktexts软件包，利用大型语言模型系统地生成风险评分，并对它们在基准预测任务中的表现进行评估。研究发现，零-shot风险评分具有较高的预测信号，但广泛存在错配的校准问题，表明基础模型过高估计结果不确定性，而经过指令调优的模型则低估不确定性并产生过于自信的风险评分。

将语言模型评估为风险评分