In this work, we propose to model the interaction between visual and textual
features for multi-modal neural machine translation (MMT) through a latent
variable model. This latent variable can be seen as a multi-modal stochastic
embedding of an image and its description in a foreign la