Despite recent advances in video-based action recognition and robust spatio-temporal modeling, most of the proposed approaches rely on the abundance of computational resources to afford running huge and computation-intensive convolutional or transformer-based neural networks to obtain satisfactory results. This limits the deployment of such models on edge devices with limited power and computing resources. In this work we investigate an important smart home application, video based delivery detection, and present a simple and lightweight pipeline for this task that can run on resource-constrained doorbell cameras. Our proposed pipeline relies on motion cues to generate a set of coarse activity proposals followed by their classification with a mobile-friendly 3DCNN network. For training we design a novel semi-supervised attention module that helps the network to learn robust spatio-temporal features and adopt an evidence-based optimization objective that allows for quantifying the uncertainty of predictions made by the network. Experimental results on our curated delivery dataset shows the significant effectiveness of our pipeline compared to alternatives and highlights the benefits of our training phase novelties to achieve free and considerable inference-time performance gains.

本文提出一种基于视频和移动设备的简单和轻量级交付检测流程，该流程依赖于运动线索生成一组粗略的活动建议，随后使用移动友好的 3DCNN 网络对其进行分类。同时，利用半监督注意模块并采用基于证据的优化目标进行训练，以获得鲁棒的时空特征。实验结果表明，与现有方法相比，本方法具有显著的性能优势，适用于资源受限的门铃摄像头等边缘设备。

门铃摄像头上的轻量级物品递送监测