Generative AI (GenAI) models have become vital across industries, yet current evaluation methods have not adapted to their widespread use. Traditional evaluations often rely on benchmarks and fixed datasets, frequently failing to reflect real-world performance, which creates a gap between lab-tested outcomes and practical applications. This white paper proposes a comprehensive framework for how we should evaluate real-world GenAI systems, emphasizing diverse, evolving inputs and holistic, dynamic, and ongoing assessment approaches. The paper offers guidance for practitioners on how to design evaluation methods that accurately reflect real-time capabilities, and provides policymakers with recommendations for crafting GenAI policies focused on societal impacts, rather than fixed performance numbers or parameter sizes. We advocate for holistic frameworks that integrate performance, fairness, and ethics and the use of continuous, outcome-oriented methods that combine human and automated assessments while also being transparent to foster trust among stakeholders. Implementing these strategies ensures GenAI models are not only technically proficient but also ethically responsible and impactful.

本研究解决了当前生成性人工智能（GenAI）模型评估方法无法适应实际应用的问题。提出了一种全面的评估框架，强调多样化的输入和持续的评估方法，显著提升了模型在真实世界中的表现，与政策制定者的社会影响导向相结合。研究结果表明，实施此框架能够确保GenAI模型既具技术能力，又具伦理责任，具有积极影响。

“野外”人工智能系统评估框架