Many first-person vision tasks such as activity recognition or video summarization requires knowing, which objects the camera wearer is interacting with (i.e. action-objects). The standard way to obtain this information is via a manual annotation, which is costly and time consuming. Also, whereas for the third-person tasks such as object detection, the annot