Large language models (LLMs) with powerful generalization ability has been
widely used in many domains. A systematic and reliable evaluation of LLMs is a
crucial step in their development and applications, especially for specific
professional fields. In the urban domain, there have been some early
explorations about the usability of LLMs, but a systematic and scalable
evaluation benchmark is still lacking. The challenge in constructing a
systematic evaluation benchmark for the urban domain lies in the diversity of
data and scenarios, as well as the complex and dynamic nature of cities. In
this paper, we propose CityBench, an interactive simulator based evaluation
platform, as the first systematic evaluation benchmark for the capability of
LLMs for urban domain. First, we build CitySim to integrate the multi-source
data and simulate fine-grained urban dynamics. Based on CitySim, we design 7
tasks in 2 categories of perception-understanding and decision-making group to
evaluate the capability of LLMs as city-scale world model for urban domain. Due
to the flexibility and ease-of-use of CitySim, our evaluation platform
CityBench can be easily extended to any city in the world. We evaluate 13
well-known LLMs including open source LLMs and commercial LLMs in 13 cities
around the world. Extensive experiments demonstrate the scalability and
effectiveness of proposed CityBench and shed lights for the future development
of LLMs in urban domain. The dataset, benchmark and source codes are openly
accessible to the research community via
this https URL

在这篇论文中，我们提出了 CityBench 作为第一个用于评估大规模语言模型在城市领域能力的系统性评估基准，通过构建 CitySim 来整合多源数据并模拟细粒度的城市动态，设计了 7 个任务用于评估 LLMs 作为城市规模世界模型在感知理解和决策制定方面的能力，在 13 个城市的 13 个知名 LLMs 上进行了广泛实验，结果表明 CityBench 的可扩展性和效果，并对未来城市领域 LLMs 的发展提供了启示。

CityBench: 评估大型语言模型作为世界模型的能力

CityBench: Evaluating the Capabilities of Large Language Model as World  Model

To evaluate perception components of an automated driving system, it is
necessary to define the relevant objects. While the urban domain is popular
among perception datasets, relevance is insufficiently specified for this
domain. Therefore, this work adopts an existing method to define relevance in
the highway domain and expands it to the urban domain. While different
conceptualizations and definitions of relevance are present in literature,
there is a lack of methods to validate these definitions. Therefore, this work
presents a novel relevance validation method leveraging a motion prediction
component. The validation leverages the idea that removing irrelevant objects
should not influence a prediction component which reflects human driving
behavior. The influence on the prediction is quantified by considering the
statistical distribution of prediction performance across a large-scale
dataset. The validation procedure is verified using criteria specifically
designed to exclude relevant objects. The validation method is successfully
applied to the relevance criteria from this work, thus supporting their
validity.

通过采用现有方法扩展到城市领域，本研究定义感知数据集中的相关对象，并提供了一种基于运动预测组件的新的相关性验证方法，通过考虑大规模数据集中的预测性能的统计分布来量化对预测的影响，成功地验证了所提出的相关性标准的有效性。