multimodal processing has attracted much attention lately especially with the
success of pre-training. However, the exploration has mainly focused on
vision-language pre-training, as introducing more modalities can greatly
complicate model design and optimization. In this paper, we ext