The goal of a speech-to-image transform is to produce a photo-realistic
picture directly from a speech signal. Recently, various studies have focused
on this task and have achieved promising performance. However, current
speech-to-image approaches are based on a stacked modular framewo