In the vast and dynamic landscape of urban settings, Traffic Safety Description and Analysis plays a pivotal role in applications ranging from insurance inspection to accident prevention. This paper introduces CityLLaVA, a novel fine-tuning framework for Visual Language Models (VLMs) designed for urban scenarios. CityLLaVA enhances model comprehension and prediction accuracy through (1) employing bounding boxes for optimal visual data preprocessing, including video best-view selection and visual prompt engineering during both training and testing phases; (2) constructing concise Question-Answer sequences and designing textual prompts to refine instruction comprehension; (3) implementing block expansion to fine-tune large VLMs efficiently; and (4) advancing prediction accuracy via a unique sequential questioning-based prediction augmentation. Demonstrating top-tier performance, our method achieved a benchmark score of 33.4308, securing the leading position on the leaderboard. The code can be found: https://github.com/alibaba/AICITY2024_Track2_AliOpenTrek_CityLLaVA

城市场景交通安全描述与分析在保险检查和事故预防等应用中起着关键作用。本文介绍了CityLLaVA，一种专门用于城市场景的视觉语言模型的新的微调框架，通过采用边界框进行最佳视觉数据预处理，包括视频最佳视角选择和在训练和测试阶段进行视觉提示工程；构建简明的问答序列和设计文本提示以提高指令理解；通过块扩展高效微调大型视觉语言模型，并通过一种独特的顺序提问预测增强方法提高预测准确性。在实验中，我们的方法达到了33.4308的基准分数，在排行榜上占据了领先位置。

CityLLaVA: 城市场景下VLMs的高效微调