AbstractWith the emergence of large language models (LLMs) and vision foundation models, how to combine the intelligence and capacity of these open-sourced or API-available models to achieve
open-world visual perception remains an open question. In this paper, we introduce
→