This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at https://github.com/lingli1996/GeoReasoner.

使用具有人类推理知识的大型视觉语言模型（LVLM）的新范式来解决地理定位问题。该模型通过使用基于CLIP的网络来评估街景图像的可定位程度，并整合来自真实地理定位游戏的外部知识，训练出了一种名为GeoReasoner的模型，优于其他LVLM模型25%以上以及StreetCLIP模型，并且需要更少的训练资源。

GeoReasoner：使用大型视觉语言模型的街景推理地理定位