Video multimodal fusion aims to integrate multimodal signals in videos, such as visual, audio and text, to make a complementary prediction with multiple modalities contents. However, unlike other image-text multimodal tasks, video has longer multimodal sequences with more redundancy and noise in both visual and audio modalities. Prior denoising methods like forget gate are coarse in the granularity of noise filtering. They often suppress the redundant and noisy information at the risk of losing critical information. Therefore, we propose a denoising bottleneck fusion (DBF) model for fine-grained video multimodal fusion. On the one hand, we employ a bottleneck mechanism to filter out noise and redundancy with a restrained receptive field. On the other hand, we use a mutual information maximization module to regulate the filter-out module to preserve key information within different modalities. Our DBF model achieves significant improvement over current state-of-the-art baselines on multiple benchmarks covering multimodal sentiment analysis and multimodal summarization tasks. It proves that our model can effectively capture salient features from noisy and redundant video, audio, and text inputs. The code for this paper is publicly available at https://github.com/WSXRHFG/DBF.

本论文提出了一种细粒度的视频多模态融合去噪模型（DBF），它使用了瓶颈机制来过滤噪声和冗余信息，并采用互信息最大化模块来调节过滤器以保留不同模态中的关键信息。实验表明，我们的 DBF 模型在多个基准测试中均取得了显着的改进效果，涵盖了多模态情感分析和多模态摘要等任务，证明了该模型可以有效地从嘈杂和冗余的视频，音频和文本输入中捕捉到显著特征。

利用互信息最大化进行视频多模态融合的降噪瓶颈