Large language models (LLMs) with powerful generalization ability has been
widely used in many domains. A systematic and reliable evaluation of LLMs is a
crucial step in their development and applications, especially for specific
professional fields. In the urban domain, there have been some early
explorations about the usability of LLMs, but a systematic and scalable
evaluation benchmark is still lacking. The challenge in constructing a
systematic evaluation benchmark for the urban domain lies in the diversity of
data and scenarios, as well as the complex and dynamic nature of cities. In
this paper, we propose CityBench, an interactive simulator based evaluation
platform, as the first systematic evaluation benchmark for the capability of
LLMs for urban domain. First, we build CitySim to integrate the multi-source
data and simulate fine-grained urban dynamics. Based on CitySim, we design 7
tasks in 2 categories of perception-understanding and decision-making group to
evaluate the capability of LLMs as city-scale world model for urban domain. Due
to the flexibility and ease-of-use of CitySim, our evaluation platform
CityBench can be easily extended to any city in the world. We evaluate 13
well-known LLMs including open source LLMs and commercial LLMs in 13 cities
around the world. Extensive experiments demonstrate the scalability and
effectiveness of proposed CityBench and shed lights for the future development
of LLMs in urban domain. The dataset, benchmark and source codes are openly
accessible to the research community via
this https URL

在这篇论文中，我们提出了 CityBench 作为第一个用于评估大规模语言模型在城市领域能力的系统性评估基准，通过构建 CitySim 来整合多源数据并模拟细粒度的城市动态，设计了 7 个任务用于评估 LLMs 作为城市规模世界模型在感知理解和决策制定方面的能力，在 13 个城市的 13 个知名 LLMs 上进行了广泛实验，结果表明 CityBench 的可扩展性和效果，并对未来城市领域 LLMs 的发展提供了启示。

CityBench: 评估大型语言模型作为世界模型的能力

CityBench: Evaluating the Capabilities of Large Language Model as World  Model

Various machine learning approaches have gained significant popularity for
the automated classification of educational text to identify indicators of
learning engagement -- i.e. learning engagement classification (LEC). LEC can
offer comprehensive insights into human learning processes, attracting
significant interest from diverse research communities, including Natural
Language Processing (NLP), Learning Analytics, and Educational Data Mining.
Recently, Large Language Models (LLMs), such as ChatGPT, have demonstrated
remarkable performance in various NLP tasks. However, their comprehensive
evaluation and improvement approaches in LEC tasks have not been thoroughly
investigated. In this study, we propose the Annotation Guidelines-based
Knowledge Augmentation (AGKA) approach to improve LLMs. AGKA employs GPT 4.0 to
retrieve label definition knowledge from annotation guidelines, and then
applies the random under-sampler to select a few typical examples.
Subsequently, we conduct a systematic evaluation benchmark of LEC, which
includes six LEC datasets covering behavior classification (question and
urgency level), emotion classification (binary and epistemic emotion), and
cognition classification (opinion and cognitive presence). The study results
demonstrate that AGKA can enhance non-fine-tuned LLMs, particularly GPT 4.0 and
Llama 3 70B. GPT 4.0 with AGKA few-shot outperforms full-shot fine-tuned models
such as BERT and RoBERTa on simple binary classification datasets. However, GPT
4.0 lags in multi-class tasks that require a deep understanding of complex
semantic information. Notably, Llama 3 70B with AGKA is a promising combination
based on open-source LLM, because its performance is on par with closed-source
GPT 4.0 with AGKA. In addition, LLMs struggle to distinguish between labels
with similar names in multi-class classification.

使用基于批注指南的知识增强（AGKA）方法，我们对大型语言模型（LLMs）进行了综合评估，并在学习参与度分类（LEC）任务上取得了改进。AGKA 利用 GPT 4.0 从批注指南中检索标签定义知识，并应用随机欠采样器选择一些典型示例，通过六个 LEC 数据集对其进行了系统评估。结果表明，AGKA 可以提升非微调的 LLMs，在简单的二分类任务上 GPT 4.0 的 AGKA few-shot 的性能优于 BERT 和 RoBERTa 等全微调模型，但在需要深入理解复杂语义信息的多类任务中，GPT 4.0 有所落后。值得注意的是，基于开源 LLM 的 Llama 3 70B 与 AGKA 结合具有很大潜力，并且性能与基于闭源的 GPT 4.0 与 AGKA 相媲美，但 LLMs 难以区分多类分类中名称相似的标签。