The last years have shown rapid developments in the field of multimodal
machine learning, combining e.g., vision, text or speech. In this position
paper we explain how the field uses outdated definitions of multimodality that
prove unfit for the machine learning era. We propose a new t