This work strives for the classification and localization of human actions in videos, without the need for any labeled video training examples. Where existing work relies on transferring global attribute or object information from seen to unseen action videos, we seek to classify and spatio-temporally localize unseen actions in videos from image-based object information only. We propose three spatial object priors, which encode local person and object detectors along with their spatial relations. On top we introduce three semantic object priors, which extend semantic matching through word embeddings with three simple functions that tackle semantic ambiguity, object discrimination, and object naming. A video embedding combines the spatial and semantic object priors. It enables us to introduce a new video retrieval task that retrieves action tubes in video collections based on user-specified objects, spatial relations, and object size. Experimental evaluation on five action datasets shows the importance of spatial and semantic object priors for unseen actions. We find that persons and objects have preferred spatial relations that benefit unseen action localization, while using multiple languages and simple object filtering directly improves semantic matching, leading to state-of-the-art results for both unseen action classification and localization.

本研究提出了一种无需标注视频训练样例来对人类行为在视频中进行分类和空间-时间定位。该方法基于物体信息进行分类和定位，引入了三种空间物体先验和三种语义物体先验，并将二者合并为视频嵌入来进行新的视频检索任务，该任务可基于用户指定的对象、空间关系和对象大小来检索视频中的行动。实验表明，空间和语义物体先验对于未知行为的本地化非常有帮助，而使用多语言和简单对象过滤可直接改善语义匹配，从而在未知行为分类和本地化方面达到最新成果。

用于识别和定位未见过行为的对象先验