Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.

本研究针对多模态语言分析领域的一个重要空白，即现有多模态大型语言模型（MLLMs）在理解认知级语义方面的能力不足。我们提出了MMLA基准测试，以评估和提升多模态语义理解，通过分析超过61,000条多模态发言，发现即使经过优化的模型准确率仅在60%~70%之间，这表明当前模型在理解复杂人类语言方面仍存在局限性。此研究为进一步探索大型语言模型在多模态语言分析中的潜力奠定了基础，并提供了有价值的资源。 

大型语言模型能否帮助多模态语言分析？MMLA：全面基准测试