Given video data from multiple personal devices or street cameras, can we
exploit the structural and dynamic information to learn dynamic representation
of objects for applications such as distributed surveillance, without storing
data at a central server that leads to a violation of <