Large language models (LLMs) are increasingly essential in processing natural languages, yet their application is frequently compromised by biases and inaccuracies originating in their training data. In this study, we introduce Cross-Care, the first benchmark framework dedicated to assessing biases and real world knowledge in LLMs, specifically focusing on the representation of disease prevalence across diverse demographic groups. We systematically evaluate how demographic biases embedded in pre-training corpora like $ThePile$ influence the outputs of LLMs. We expose and quantify discrepancies by juxtaposing these biases against actual disease prevalences in various U.S. demographic groups. Our results highlight substantial misalignment between LLM representation of disease prevalence and real disease prevalence rates across demographic subgroups, indicating a pronounced risk of bias propagation and a lack of real-world grounding for medical applications of LLMs. Furthermore, we observe that various alignment methods minimally resolve inconsistencies in the models' representation of disease prevalence across different languages. For further exploration and analysis, we make all data and a data visualization tool available at: www.crosscare.net.

介绍了Cross-Care，这是第一个专门评估LLMs中存在的倾向和现实世界知识的基准框架，重点关注不同人群中疾病患病率的表征，并揭示了在预训练文本中嵌入的人口统计偏差如何影响LLMs的输出。结果显示，LLMs对疾病患病率的表征与不同人群实际患病率之间存在重大不一致，存在偏倚传播和缺乏实际世界基础的风险。

跨关怀: 预训练数据对语言模型偏见的医疗影响评估