Explainable artificial intelligence techniques are becoming increasingly important with the rise of deep learning applications in various domains. These techniques aim to provide a better understanding of complex "black box" models and enhance user trust while maintaining high learning performance. While many studies have focused on explaining deep learning models in computer vision for image input, video explanations remain relatively unexplored due to the temporal dimension's complexity. In this paper, we present a unified framework for local agnostic explanations in the video domain. Our contributions include: (1) Extending a fine-grained explanation framework tailored for computer vision data, (2) Adapting six existing explanation techniques to work on video data by incorporating temporal information and enabling local explanations, and (3) Conducting an evaluation and comparison of the adapted explanation methods using different models and datasets. We discuss the possibilities and choices involved in the removal-based explanation process for visual data. The adaptation of six explanation methods for video is explained, with comparisons to existing approaches. We evaluate the performance of the methods using automated metrics and user-based evaluation, showing that 3D RISE, 3D LIME, and 3D Kernel SHAP outperform other methods. By decomposing the explanation process into manageable steps, we facilitate the study of each choice's impact and allow for further refinement of explanation methods to suit specific datasets and models.

本文提出了一个针对视频领域的统一框架，旨在在维持高学习性能的同时，通过融合时序信息和实现局部解释，扩展针对计算机视觉数据的细粒度解释框架，并将六种现有的解释技术应用于视频数据，进行了评估和比较研究。研究结果表明，3D RISE、3D LIME和3D Kernel SHAP优于其他方法。通过将解释过程分解为可管理的步骤，我们便于研究每个选择的影响，并进一步改进解释方法以适应特定的数据集和模型。

局部无关视频解释：基于移除的解释在视频中的适用性研究