Large language models (LLMs) constitute a breakthrough state-of-the-art Artificial Intelligence technology which is rapidly evolving and promises to aid in medical diagnosis. However, the correctness and the accuracy of their returns has not yet been properly evaluated. In this work, we propose an LLM evaluation paradigm that incorporates two independent steps of a novel methodology, namely (1) multimodal LLM evaluation via structured interactions and (2) follow-up, domain-specific analysis based on data extracted via the previous interactions. Using this paradigm, (1) we evaluate the correctness and accuracy of LLM-generated medical diagnosis with publicly available multimodal multiple-choice questions(MCQs) in the domain of Pathology and (2) proceed to a systemic and comprehensive analysis of extracted results. We used GPT-4-Vision-Preview as the LLM to respond to complex, medical questions consisting of both images and text, and we explored a wide range of diseases, conditions, chemical compounds, and related entity types that are included in the vast knowledge domain of Pathology. GPT-4-Vision-Preview performed quite well, scoring approximately 84\% of correct diagnoses. Next, we further analyzed the findings of our work, following an analytical approach which included Image Metadata Analysis, Named Entity Recognition and Knowledge Graphs. Weaknesses of GPT-4-Vision-Preview were revealed on specific knowledge paths, leading to a further understanding of its shortcomings in specific areas. Our methodology and findings are not limited to the use of GPT-4-Vision-Preview, but a similar approach can be followed to evaluate the usefulness and accuracy of other LLMs and, thus, improve their use with further optimization.

该研究提出了一种包括多步骤评估法的大型语言模型（LLM）评估范例，通过结构化的交互方式进行多模态LLM评估，并通过获取交互数据进行后续领域特定的分析，以提高其准确性和实用性。研究以GPT-4-Vision-Preview为LLM，使用多模态多项选择题评估其在病理学领域的医学诊断准确性，结果表明其约有84%的正确诊断，同时通过进一步的分析揭示了其在特定领域的不足之处。该方法和结果不仅适用于GPT-4-Vision-Preview，还可应用于评估其他LLMs的准确性和实用性，以进一步优化其应用。

评估基于LLM生成的医学图像和症状分析的多模态诊断