Medical visual question answering (VQA) bridges the gap between visual information and clinical decision-making, enabling doctors to extract understanding from clinical images and videos. In particular, surgical VQA can enhance the interpretation of surgical data, aiding in accurate diagnoses, effective education, and clinical interventions. However, the inability of VQA models to visually indicate the regions of interest corresponding to the given questions results in incomplete comprehension of the surgical scene. To tackle this, we propose the surgical visual question localized-answering (VQLA) for precise and context-aware responses to specific queries regarding surgical images. Furthermore, to address the strong demand for safety in surgical scenarios and potential corruptions in image acquisition and transmission, we propose a novel approach called Calibrated Co-Attention Gated Vision-Language (C$^2$G-ViL) embedding to integrate and align multimodal information effectively. Additionally, we leverage the adversarial sample-based contrastive learning strategy to boost our performance and robustness. We also extend our EndoVis-18-VQLA and EndoVis-17-VQLA datasets to broaden the scope and application of our data. Extensive experiments on the aforementioned datasets demonstrate the remarkable performance and robustness of our solution. Our solution can effectively combat real-world image corruption. Thus, our proposed approach can serve as an effective tool for assisting surgical education, patient care, and enhancing surgical outcomes.

本研究针对外科视觉问答（VQA）模型无法准确指示与特定问题相关的视觉区域的问题，提出了外科视觉问题定位回答（VQLA）方法，以实现对外科图像的精确和上下文相关的响应。通过引入校准共注意力门控视觉-语言（C²G-ViL）嵌入以及对抗样本对比学习策略，显著提升了模型的鲁棒性和性能，为外科教育、病人护理和外科结果的改善提供了有效工具。

外科VQLA++：用于机器人手术的对抗性对比学习的标定鲁棒视觉问题定位回答