We introduce a comprehensive Linguistic Benchmark designed to evaluate the
limitations of Large Language Models (LLMs) in domains such as logical
reasoning, spatial intelligence, and linguistic understanding, among others.
Through a series of straightforward questions, it uncovers the significant
limitations of well-regarded models to perform tasks that humans manage with
ease. It also highlights the potential of prompt engineering to mitigate some
errors and underscores the necessity for better training methodologies. Our
findings stress the importance of grounding LLMs with human reasoning and
common sense, emphasising the need for human-in-the-loop for enterprise
applications. We hope this work paves the way for future research to enhance
the usefulness and reliability of new models.

我们引入了一个综合的语言基准测试来评估大型语言模型在逻辑推理、空间智能和语言理解等领域的局限性。通过一系列简单的问题，它揭示了知名模型在执行人类轻松处理的任务时存在的显著限制。它还强调了提示工程的潜力以缓解一些错误，并强调了更好的训练方法的必要性。我们的研究结果强调了将大型语言模型与人类推理和常识连接起来的重要性，并强调了人在企业应用中的必要性。我们希望这项工作为未来的研究提供了增强新模型的实用性和可靠性的途径。