We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.

GAIA是用于智能助手的基准测试，如果解决，将代表AI研究的里程碑。GAIA提出需要一系列基本能力的真实世界问题，例如推理、多模态处理、浏览网页和一般工具使用能力。GAIA的问题对人类来说概念上简单，但对大多数先进的AI来说具有挑战性。该研究证明，人类回答正确率为92％，而装备插件的GPT-4仅为15％。GAIA的理念与目前的AI基准测试趋势不同，目标是让任务对人类来说更加困难。我们认为，人工通用智能(AGI)的来临取决于系统在这类问题上具有与普通人类相似的强大稳健性。使用GAIA的方法，我们设计了466个问题及其答案。我们发布了这些问题并保留了其中300个问题的答案，以提供一个可在此https URL上获取的排行榜。

GAIA：通用人工智能助理的基准测试