This work explores the performance of a large video understanding foundation model on the downstream task of human fall detection on untrimmed video and leverages a pretrained vision transformer for multi-class action detection, with classes: "Fall", "Lying" and "Other/Activities of daily living (ADL)". A method for temporal action localization that relies on a simple cutup of untrimmed videos is demonstrated. The methodology includes a preprocessing pipeline that converts datasets with timestamp action annotations into labeled datasets of short action clips. Simple and effective clip-sampling strategies are introduced. The effectiveness of the proposed method has been empirically evaluated on the publicly available High-Quality Fall Simulation Dataset (HQFSD). The experimental results validate the performance of the proposed pipeline. The results are promising for real-time application, and the falls are detected on video level with a state-of-the-art 0.96 F1 score on the HQFSD dataset under the given experimental settings. The source code will be made available on GitHub.

基于大型视频理解模型，本研究探讨了在未修剪视频中进行人类跌倒检测的性能，并利用预训练的视觉变换器进行多类别动作检测，包括“跌倒”、“躺下”和“其他/日常活动”。方法中介绍了一种基于未修剪视频简单截取的时间动作定位方法，并引入了简单而有效的剪辑采样策略。实验结果验证了该方法的性能，表明在给定的实验设置下，实时应用上能以0.96的F1分数检测到跌倒事件。源代码将在GitHub上提供。

切割与检测：使用大型基础视频理解模型对切割未修剪视频进行人类跌倒检测