Recent advancements in multimodal techniques open exciting possibilities for
models excelling in diverse tasks involving text, audio, and image processing.
Models like GPT-4V, blending computer vision and language modeling, excel in
complex text and image tasks. Numerous prior research endeavors have diligently
examined the performance of these Vision Large Language Models (VLLMs) across
tasks like object detection, image captioning and others. However, these
analyses often focus on evaluating the performance of each modality in
isolation, lacking insights into their cross-modal interactions. Specifically,
questions concerning whether these vision-language models execute vision and
language tasks consistently or independently have remained unanswered. In this
study, we draw inspiration from recent investigations into multilingualism and
conduct a comprehensive analysis of model's cross-modal interactions. We
introduce a systematic framework that quantifies the capability disparities
between different modalities in the multi-modal setting and provide a set of
datasets designed for these evaluations. Our findings reveal that models like
GPT-4V tend to perform consistently modalities when the tasks are relatively
simple. However, the trustworthiness of results derived from the vision
modality diminishes as the tasks become more challenging. Expanding on our
findings, we introduce "Vision Description Prompting," a method that
effectively improves performance in challenging vision-related tasks.

通过对多模态机制的详细分析，揭示了 GPT-4V 等模型执行视觉和语言任务的一致性与独立性，并引入了一种名为 “Vision Description Prompting” 的方法，有效提高了具有挑战性的视觉相关任务的性能。

迷失在翻译中：当 GPT-4V (ision) 无法与文字心有灵犀。VLLMs 及更多的视觉语言一致性分析

Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text.  A Vision-Language-Consistency Analysis of VLLMs and Beyond

Prosthetic Joint Infection (PJI) is a prevalent and severe complication
characterized by high diagnostic challenges. Currently, a unified diagnostic
standard incorporating both computed tomography (CT) images and numerical text
data for PJI remains unestablished, owing to the substantial noise in CT images
and the disparity in data volume between CT images and text data. This study
introduces a diagnostic method, HGT, based on deep learning and multimodal
techniques. It effectively merges features from CT scan images and patients'
numerical text data via a Unidirectional Selective Attention (USA) mechanism
and a graph convolutional network (GCN)-based feature fusion network. We
evaluated the proposed method on a custom-built multimodal PJI dataset,
assessing its performance through ablation experiments and interpretability
evaluations. Our method achieved an accuracy (ACC) of 91.4\% and an area under
the curve (AUC) of 95.9\%, outperforming recent multimodal approaches by 2.9\%
in ACC and 2.2\% in AUC, with a parameter count of only 68M. Notably, the
interpretability results highlighted our model's strong focus and localization
capabilities at lesion sites. This proposed method could provide clinicians
with additional diagnostic tools to enhance accuracy and efficiency in clinical
practice.

本研究提出了一种基于深度学习和多模态技术的诊断方法 HGT，通过单向选择性注意机制和基于图卷积网络（GCN）的特征融合网络，有效地将 CT 扫描图像和患者的数字文本数据特征融合。经过消融实验和可解释性评估，该方法在自定义的多模态 PJI 数据集上取得了 91.4％ 的准确率和 95.9％ 的曲线下面积（AUC），在 ACC 和 AUC 上分别优于最近的多模态方法，并仅使用了 68M 的参数计数。该方法可以为临床医生提供额外的诊断工具，以提高诊断精度和效率。

HGT: 使用 CT 图像和文本进行多模态假体周围关节感染诊断的分层 GCN 基础 Transformer

HGT: A Hierarchical GCN-Based Transformer for Multimodal Periprosthetic  Joint Infection Diagnosis Using CT Images and Text

As advances in large language models (LLMs) and multimodal techniques
continue to mature, the development of general-purpose multimodal large
language models (MLLMs) has surged, with significant applications in natural
image interpretation. However, the field of pathology has largely remained
untapped in this regard, despite the growing need for accurate, timely, and
personalized diagnostics. To bridge the gap in pathology MLLMs, we present the
PathAsst in this study, which is a generative foundation AI assistant to
revolutionize diagnostic and predictive analytics in pathology. To develop
PathAsst, we collect over 142K high-quality pathology image-text pairs from a
variety of reliable sources, including PubMed, comprehensive pathology
textbooks, reputable pathology websites, and private data annotated by
pathologists. Leveraging the advanced capabilities of ChatGPT/GPT-4, we
generate over 180K instruction-following samples. Furthermore, we devise
additional instruction-following data, specifically tailored for the invocation
of the pathology-specific models, allowing the PathAsst to effectively interact
with these models based on the input image and user intent, consequently
enhancing the model's diagnostic capabilities. Subsequently, our PathAsst is
trained based on Vicuna-13B language model in coordination with the CLIP vision
encoder. The results of PathAsst show the potential of harnessing the
AI-powered generative foundation model to improve pathology diagnosis and
treatment processes. We are committed to open-sourcing our meticulously curated
dataset, as well as a comprehensive toolkit designed to aid researchers in the
extensive collection and preprocessing of their own datasets. Resources can be
obtained at
this https URL

本文提出了 PathAsst，一种生成式 AI 助手，利用了 ChatGPT/GPT-4 和 Vicuna-13B 语言模型与 CLIP 视觉编码器，对 142K 高质量病理图像文本对进行了训练。结果表明，利用这种 AI 模型可以改善病理诊断和治疗过程。