In this thesis, I evaluate the performance of Large Language Models (LLMs) on the Law School Admissions Test (LSAT), specifically the Logic Games section of the test. I focus on this section because it presents a complex logical reasoning task and thus is a valuable source of data for evaluating how modern, increasingly capable LLMs can handle hard logical reasoning tasks. I construct a dataset of LSAT logic games and their associated metadata, and extensively evaluate LLMs' performance in a Chain-of-Thought prompting setting. Given the weak performance in this setting, I explore other prompting frameworks on a smaller subset of the dataset, adapting ideas from Reflexion to this task. This results in a substantially improved accuracy of 70 percent for GPT-4 and 46 percent for GPT-3.5 on this data subset, highlighting the capacity of LLMs to revise their logical errors, despite initially weak performance. Finally, I analyze the types of logic games that models perform better or worse on, as well as the types of logical errors I observe from human annotation, providing detailed insights on the logical reasoning capabilities of LLMs.

本研究评估了大型语言模型在法学院入学考试（LSAT）中，特别是在逻辑游戏部分的表现。研究构建了一个数据集并探索了不同的提示框架，发现通过改进的提示方法，GPT-4的准确率提高至70%，突出显示了大型语言模型在修正逻辑错误方面的潜力。研究还对模型在不同类型逻辑游戏中的表现进行了深入分析。

迷失于逻辑：对大型语言模型在LSAT逻辑游戏中的推理能力的评估