speaker extraction seeks to extract the target speech in a multi-talker
scenario given an auxiliary reference. Such reference can be auditory, i.e., a
pre-recorded speech, visual, i.e., lip movements, or contextual, i.e., phonetic
sequence. References in different modalities provide di