It is challenging to perform question-answering over complex, multimodal content such as television clips. This is in part because current video-language models rely on single-modality reasoning, have lowered performance on long inputs, and lack interpetability. We propose TV-TREES, the first multimodal entailment tree generator. TV-TREES serves as an approach to video understanding that promotes interpretable joint-modality reasoning by producing trees of entailment relationships between simple premises directly entailed by the videos and higher-level conclusions. We then introduce the task of multimodal entailment tree generation to evaluate the reasoning quality of such methods. Our method's experimental results on the challenging TVQA dataset demonstrate intepretable, state-of-the-art zero-shot performance on full video clips, illustrating a best of both worlds contrast to black-box methods.

该论文提出了一种基于TV-TREES的多模态蕴涵树生成器，用于解决在电视剪辑等复杂多模态内容上的问答问题，通过生成简单前提与视频直接蕴涵的更高级结论之间的蕴涵关系树，实现可解释的联合模态推理；在TVQA数据集上进行的实验证实了该方法在全视频剪辑上的零样本性能，在黑盒方法上取得了最先进的可解释性和性能的最佳结合。

TV-TREES: 多模态蕴涵树用于神经符号化视频推理