The recent emergence of large language models (LLMs) have attracted considerable attention. LLMs may interact with users in the form of dialogue and generate responses following their instructions, which naturally require dialogue comprehension abilities. Without correct comprehension of the dialogue, the model may inevitably generate incorrect responses. However, dialogue comprehension is a general language ability which is hard to be evaluated directly. In this work, we propose to perform the evaluation with the help of the dialogue summarization task. Beside evaluating and analyzing the dialogue summarization performance (DIAC-Sum), we also derive factual questions from the generated summaries and use them as a more flexible measurement of dialogue comprehension (DIAC-FactQA). Our evaluation shows that, on average, 27% of the summaries generated by LLMs contain factual inconsistency. Even ChatGPT, the strongest evaluated model, has such errors in 16% of its summaries. For answering the factual questions, which is more challenging, the average accuracy of all evaluated LLMs is only 62.8%. Both results indicate serious deficiencies. Detailed analysis shows that the understanding of subject/object of the conversation is still the most challenging problem for LLMs. Furthermore, to stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data. The experimental results demonstrate that our method achieved an accuracy improvement of 8.9% on DIAC-FactQA.

最近大型语言模型（LLMs）的出现吸引了相当多的注意力。本研究提出利用对话摘要任务评估对话理解性能，并从生成的摘要中推导出事实性问题作为对话理解的更灵活的测量方式。评估结果表明，大多数LLMs生成的摘要中有27%的事实不一致，即使最强模型ChatGPT也有16%的错误摘要，而对于更具挑战性的事实问题回答，所有评估的LLMs的平均准确率仅为62.8%。详细分析表明，LLMs对话理解能力中最令人挑战的问题仍然是对话的主题/客体的理解，为了刺激和提高LLMs对话理解能力，我们提出了一种通过自动构建多任务数据进行微调的范式，实验结果显示我们的方法在DIAC-FactQA上获得了8.9%的准确率提升。

探究大型语言模型的对话理解能力