We propose a novel approach for automatic chaptering of TV newscast videos, addressing the challenge of structuring and organizing large collections of unsegmented broadcast content. Our method integrates both audio and visual cues through a two-stage process involving frozen neural networks and a trained LSTM network. The first stage extracts essential features from separate modalities, while the LSTM effectively fuses these features to generate accurate segment boundaries. Our proposed model has been evaluated on a diverse dataset comprising over 500 TV newscast videos of an average of 41 minutes gathered from TF1, a French TV channel, with varying lengths and topics. Experimental results demonstrate that this innovative fusion strategy achieves state of the art performance, yielding a high precision rate of 82% at IoU of 90%. Consequently, this approach significantly enhances analysis, indexing and storage capabilities for TV newscast archives, paving the way towards efficient management and utilization of vast audiovisual resources.

我们提出了一种自动化分章节的电视新闻节目视频的新方法，通过冻结神经网络和训练的LSTM网络，集成了音频和视觉线索来准确生成节目片段边界，并在500多个电视新闻节目视频的多样数据集上进行了评估，结果表明这种创新的融合策略达到了最新性能，较高的精度率为82%的IoU。因此，这种方法显著提升了电视新闻节目档案的分析、索引和存储能力，为大规模音视频资源的高效管理和利用铺平了道路。

长篇电视新闻节目视频的多模态分章