The rapid advances of multi-modal agents built on large foundation models have largely overlooked their potential for language-based Communication between agents in collaborative tasks. This oversight presents a critical gap in understanding their effectiveness in real-world deployment