The performance on Large Language Models (LLMs) on existing reasoning
benchmarks has shot up considerably over the past years. In response, we
present JEEBench, a considerably more challenging benchmark dataset for
evaluating the problem solving abilities of LLMs. We curate 450 challenging
pre-engineering mathematics, physics and chemistry problems from the IIT
JEE-Advanced exam. Long-horizon reasoning on top of deep in-domain knowledge is
essential for solving problems in this benchmark. Our evaluation on the GPT
series of models reveals that although performance improves with newer models,
the best being GPT-4, the highest performance, even after using techniques like
Self-Consistency and Chain-of-Thought prompting is less than 40 percent. Our
analysis demonstrates that errors in algebraic manipulation and failure in
retrieving relevant domain specific concepts are primary contributors to GPT4's
low performance. Given the challenging nature of the benchmark, we hope that it
can guide future research in problem solving using LLMs. Our code and dataset
is available here.

本文介绍了一个新的基准数据集 JEEBench，用于评估 Large Language Models 的问题解决能力，其中包含了 450 个有挑战性的预工程数学、物理和化学问题。本文对 GPT 系列模型进行了评估，发现即使使用 Self-Consistency 和 Chain-of-Thought prompting 等技术，GPT4 的最佳表现仍不到 40％，错误的代数运算和缺乏相关领域知识是造成表现不佳的主要原因。作者希望这个基准数据集能够引导未来使用 Large Language Models 进行问题解决的研究。