Past works on multimodal machine translation (MMT) elevate bilingual setup by
incorporating additional aligned vision information. However, an image-must
requirement of the multimodal dataset largely hinders MMT's development --
namely that it demands an aligned form of [image, source