Gaze estimation methods often experience significant performance degradation
when evaluated across different domains, due to the domain gap between the
testing and training data. Existing methods try to address this issue using
various domain generalization approaches, but with little success because of
the limited diversity of gaze datasets, such as appearance, wearable, and image
quality. To overcome these limitations, we propose a novel framework called
CLIP-Gaze that utilizes a pre-trained vision-language model to leverage its
transferable knowledge. Our framework is the first to leverage the
vision-and-language cross-modality approach for gaze estimation task.
Specifically, we extract gaze-relevant feature by pushing it away from
gaze-irrelevant features which can be flexibly constructed via language
descriptions. To learn more suitable prompts, we propose a personalized context
optimization method for text prompt tuning. Furthermore, we utilize the
relationship among gaze samples to refine the distribution of gaze-relevant
features, thereby improving the generalization capability of the gaze
estimation model. Extensive experiments demonstrate the excellent performance
of CLIP-Gaze over existing methods on four cross-domain evaluations.

通过使用预训练的视觉语言模型来提高视线估计的泛化能力，本研究提出了一种名为 CLIP-Gaze 的新型框架。该框架通过在语言描述中构建视线相关特征并将其与视线无关特征相区分，采用个性化上下文优化方法进行文本提示调整，并利用视线样本之间的关系改进视线估计模型的泛化能力。对四个跨领域评估结果表明，CLIP-Gaze 方法的性能优于现有方法。