In this research, we introduce BEATS, a novel framework for evaluating Bias, Ethics, Fairness, and Factuality in Large Language Models (LLMs). Building upon the BEATS framework, we present a bias benchmark for LLMs that measure performance across 29 distinct metrics. These metrics span a broad range of characteristics, including demographic, cognitive, and social biases, as well as measures of ethical reasoning, group fairness, and factuality related misinformation risk. These metrics enable a quantitative assessment of the extent to which LLM generated responses may perpetuate societal prejudices that reinforce or expand systemic inequities. To achieve a high score on this benchmark a LLM must show very equitable behavior in their responses, making it a rigorous standard for responsible AI evaluation. Empirical results based on data from our experiment show that, 37.65\% of outputs generated by industry leading models contained some form of bias, highlighting a substantial risk of using these models in critical decision making systems. BEATS framework and benchmark offer a scalable and statistically rigorous methodology to benchmark LLMs, diagnose factors driving biases, and develop mitigation strategies. With the BEATS framework, our goal is to help the development of more socially responsible and ethically aligned AI models.

该研究提出了BEATS框架，用于评估大型语言模型中的偏见、伦理、公平性和事实性，填补了现有评估工具的空白。通过提供29个分指标的偏见基准，研究揭示了行业主流模型在生成输出时存在的偏见风险，显示出在关键决策系统中使用这些模型的潜在问题。BEATS框架和基准为评估大型语言模型提供了可扩展和统计严格的方法，旨在促进更具社会责任感和伦理对齐的AI模型的发展。

BEATS：大型语言模型的偏见评估和测评套件