We introduce FACTS Grounding, an online leaderboard and associated benchmark that evaluates language models' ability to generate text that is factually accurate with respect to given context in the user prompt. In our benchmark, each prompt includes a user request and a full document, with a maximum length of 32k tokens, requiring long-form responses. The long-form responses are required to be fully grounded in the provided context document while fulfilling the user request. Models are evaluated using automated judge models in two phases: (1) responses are disqualified if they do not fulfill the user request; (2) they are judged as accurate if the response is fully grounded in the provided document. The automated judge models were comprehensively evaluated against a held-out test-set to pick the best prompt template, and the final factuality score is an aggregate of multiple judge models to mitigate evaluation bias. The FACTS Grounding leaderboard will be actively maintained over time, and contains both public and private splits to allow for external participation while guarding the integrity of the leaderboard. It can be found at https://www.kaggle.com/facts-leaderboard.

本研究提出了FACTS Grounding，一个在线领导者榜单及其基准，旨在评估语言模型生成相对于用户提示所给上下文的事实准确性。通过要求长形式响应完全依赖于提供的文档，该研究展示了一种新的评估方法，并发现这一框架能有效评判模型的响应准确性和满足用户请求的能力。

FACTS基础领导者榜单：评估大型语言模型针对长文本输入的响应准确性