Over the past decade, visual gaze estimation has garnered growing attention
within the research community, thanks to its wide-ranging application
scenarios. While existing estimation approaches have achieved remarkable
success in enhancing prediction accuracy, they primarily infer gaze directions
from single-image signals and discard the huge potentials of the currently
dominant text guidance. Notably, visual-language collaboration has been
extensively explored across a range of visual tasks, such as image synthesis
and manipulation, leveraging the remarkable transferability of large-scale
Contrastive Language-Image Pre-training (CLIP) model. Nevertheless, existing
gaze estimation approaches ignore the rich semantic cues conveyed by linguistic
signals and priors in CLIP feature space, thereby yielding performance
setbacks. In pursuit of making up this gap, we delve deeply into the text-eye
collaboration protocol and introduce a novel gaze estimation framework in this
paper, referred to as GazeCLIP. Specifically, we intricately design a
linguistic description generator to produce text signals with coarse
directional cues. Additionally, a CLIP-based backbone that excels in
characterizing text-eye pairs for gaze estimation is presented. This is
followed by the implementation of a fine-grained multi-modal fusion module
aimed at modeling the interrelationships between heterogeneous inputs.
Extensive experiments on three challenging datasets demonstrate the superiority
of the proposed GazeCLIP which surpasses the previous approaches and achieves
the state-of-the-art estimation accuracy.

通过设计文本眼部协同学习框架 GazeCLIP，结合视觉注视方向的文本信号和 Contrastive Language-Image Pre-training (CLIP) 模型的优点，实现了先进的视觉注视估计准确性，并在三个具有挑战性的数据集上展示了其在性能方面的优势。