We present a benchmark for assessing the capability of Large Language Models (LLMs) to discern intercardinal directions between geographic locations and apply it to three prominent LLMs: GPT-3.5, GPT-4, and Llama-2. This benchmark specifically evaluates whether LLMs exhibit a hierarchical spatial bias similar to humans, where judgments about individual locations' spatial relationships are influenced by the perceived relationships of the larger groups that contain them. To investigate this, we formulated 14 questions focusing on well-known American cities. Seven questions were designed to challenge the LLMs with scenarios potentially influenced by the orientation of larger geographical units, such as states or countries, while the remaining seven targeted locations less susceptible to such hierarchical categorization. Among the tested models, GPT-4 exhibited superior performance with 55.3% accuracy, followed by GPT-3.5 at 47.3%, and Llama-2 at 44.7%. The models showed significantly reduced accuracy on tasks with suspected hierarchical bias. For example, GPT-4's accuracy dropped to 32.9% on these tasks, compared to 85.7% on others. Despite these inaccuracies, the models identified the nearest cardinal direction in most cases, suggesting associative learning, embodying human-like misconceptions. We discuss the potential of text-based data representing geographic relationships directly to improve the spatial reasoning capabilities of LLMs.

我们提出了一个评估大型语言模型（LLM）判断地理位置之间的斜对角方向能力的基准，并将其应用于三个知名的LLM：GPT-3.5，GPT-4和Llama-2。在测试中，GPT-4表现出优越的性能，准确率为55.3%，其次是GPT-3.5的47.3%，Llama-2的44.7%。尽管这些模型在可能存在层次性偏差的任务上的准确性较低，但它们大多数情况下能够识别最近的基准方向，显示出类似人类的错误理解，我们讨论了直接用代表地理关系的文本数据来改进LLM的空间推理能力的潜力。

大型语言模型中判断空间关系的失真：自然语言地理数据的黎明？