speech-to-speech translation is a typical sequence-to-sequence learning task
that naturally has two directions. How to effectively leverage bidirectional
supervision signals to produce high-fidelity audio for both directions?
Existing approaches either train two separate models or a mu