Transcription or sub-titling of open-domain videos is still a challenging
domain for automatic speech recognition (ASR) due to the data's challenging
acoustics, variable signal processing and the essentially unrestricted domain
of the data. In previous work, we have shown that the visu