Recent multimodal large language models (MLLMs) are remarkable in
vision-language tasks, such as image captioning and question answering, but
lack the essential perception ability, i.e., object detection. In this work, we
address this limitation by introducing a novel research problem