learning-based text to speech systems have the potential to generalize from
one speaker to the next and thus require a relatively short sample of any new
voice. However, this promise is currently largely unrealized. We present a
method that is designed to capture a new speaker from a s