Visual and audio modalities are two symbiotic modalities underlying videos,
which contain both common and complementary information. If they can be mined
and fused sufficiently, performances of related video tasks can be
significantly enhanced. However, due to the environmental interfe