The ability to learn from context with novel concepts, and deliver appropriate responses are essential in human conversations. Despite current multimodal large language models (MLLMs) and Large Language Models (LLMs) being trained on mega-scale datasets, recognizing unseen images or un