In multi-modal frameworks, the alignment of cross-modal features presents a
significant challenge. The predominant approach in multi-modal pre-training
emphasizes either global or local alignment between modalities, utilizing
extensive datasets. This bottom-up driven method often suffers from a lack of
interpretability, a critical concern in radiology. Previous studies have
integrated high-level labels in medical images or text, but these still rely on
manual annotation, a costly and labor-intensive process. Our work introduces a
novel approach by using eye-gaze data, collected synchronously by radiologists
during diagnostic evaluations. This data, indicating radiologists' focus areas,
naturally links chest X-rays to diagnostic texts. We propose the Eye-gaze
Guided Multi-modal Alignment (EGMA) framework to harness eye-gaze data for
better alignment of image and text features, aiming to reduce reliance on
manual annotations and thus cut training costs. Our model demonstrates robust
performance, outperforming other state-of-the-art methods in zero-shot
classification and retrieval tasks. The incorporation of easily-obtained
eye-gaze data during routine radiological diagnoses signifies a step towards
minimizing manual annotation dependency. Additionally, we explore the impact of
varying amounts of eye-gaze data on model performance, highlighting the
feasibility and utility of integrating this auxiliary data into multi-modal
pre-training.

使用眼动数据来辅助图像和文本特征的对齐，以减少对手动注释的依赖和降低培训成本。同时，探讨了不同量的眼动数据对模型性能的影响，突显将此辅助数据整合到多模态预训练中的可行性和实用性。

眼球注视导向的多模态对齐框架用于放射学

Eye-gaze Guided Multi-modal Alignment Framework for Radiology

Inspired by the humans' cognitive ability to generalise knowledge and skills,
Self-Supervised Learning (SSL) targets at discovering general representations
from large-scale data without requiring human annotations, which is an
expensive and time consuming task. Its success in the fields of computer vision
and natural language processing have prompted its recent adoption into the
field of audio and speech processing. Comprehensive reviews summarising the
knowledge in audio SSL are currently missing. To fill this gap, in the present
work, we provide an overview of the SSL methods used for audio and speech
processing applications. Herein, we also summarise the empirical works that
exploit the audio modality in multi-modal SSL frameworks, and the existing
suitable benchmarks to evaluate the power of SSL in the computer audition
domain. Finally, we discuss some open problems and point out the future
directions on the development of audio SSL.

本文综述了自监督学习在音频处理和语音处理领域中的应用，包括方法、实验和基准数据，并讨论了未来发展方向和存在的问题。