Quantifying the gap between synthetic and real-world imagery is essential for improving both transformer-based models - that rely on large volumes of data - and datasets, especially in underexplored domains like aerial scene understanding where the potential impact is significant. This paper introduces a novel methodology for scene complexity assessment using Multi-Model Consensus Metric (MMCM) and depth-based structural metrics, enabling a robust evaluation of perceptual and structural disparities between domains. Our experimental analysis, utilizing real-world (Dronescapes) and synthetic (Skyscenes) datasets, demonstrates that real-world scenes generally exhibit higher consensus among state-of-the-art vision transformers, while synthetic scenes show greater variability and challenge model adaptability. The results underline the inherent complexities and domain gaps, emphasizing the need for enhanced simulation fidelity and model generalization. This work provides critical insights into the interplay between domain characteristics and model performance, offering a pathway for improved domain adaptation strategies in aerial scene understanding.

本研究针对航空场景理解中合成图像与真实图像之间的差距进行了探讨，提出了一种基于多模型共识度量（MMCM）和深度结构度量的新方法来评估场景复杂性。实验结果表明，真实场景在模型一致性方面表现更好，而合成场景则更具变异性，这强调了提高模拟真实感和模型泛化能力的必要性。

量化航空场景理解中的合成与真实领域差距