Our objective in this work is fine-grained classification of actions in
untrimmed videos, where the actions may be temporally extended or may span only
a few frames of the video. We cast this into a query-response mechanism, where
each query addresses a particular question, and has its own response label set.
We make the following four contributions: (I) We