With the emergence of LLMs and their integration with other data modalities, multi-modal 3d perception attracts more attention due to its connectivity to the physical world and makes rapid progress. However, limited by existing datasets, previous works mainly focus on understanding obj