Capturing high-level structure in audio waveforms is challenging because a
single second of audio spans tens of thousands of timesteps. While long-range
dependencies are difficult to model directly in the time domain, we show that
they can be more tractably modelled in two-dimensional time-frequency
representations such as spectrograms. By leveraging this re