As large language models (LLMs) gain traction in healthcare, concerns about their susceptibility to demographic biases are growing. We introduce {DiversityMedQA}, a novel benchmark designed to assess LLM responses to medical queries across diverse patient demographics, such as gender and ethnicity. By perturbing questions from the MedQA dataset, which comprises medical board exam questions, we created a benchmark that captures the nuanced differences in medical diagnosis across varying patient profiles. Our findings reveal notable discrepancies in model performance when tested against these demographic variations. Furthermore, to ensure the perturbations were accurate, we also propose a filtering strategy that validates each perturbation. By releasing DiversityMedQA, we provide a resource for evaluating and mitigating demographic bias in LLM medical diagnoses.

本研究解决了大型语言模型在医疗诊断中可能存在的人口统计学偏差问题。我们提出了一种新颖的基准DiversityMedQA，通过对医学考试问题进行扰动，评估不同患者群体中模型回答的差异性。研究发现，模型在不同人口统计条件下的表现存在显著差异，为评估和减少医疗诊断中的人口偏差提供了资源。

多样性医学问答：使用大型语言模型评估医疗诊断中的人口统计学偏差