AbstractWe propose an approach to learn spatio-temporal features in videos from intermediate visual representations we call "percepts" using Gated-Recurrent-Unit Recurrent Networks (
grus).Our method relies on percepts that are extracted from all level of a deep
→