Previous studies have confirmed the effectiveness of incorporating visual
information into speech enhancement (SE) systems. Despite improved denoising
performance, two problems may be encountered when implementing an audio-visual
SE (AVSE) system: (1) additional processing costs are incurred to incorporate
visual input and (2) the use of face or lip images m