This paper is about labeling video frames with action classes under weak
supervision in training, where we have access to a temporal ordering of
actions, but their start and end frames in training videos are unknown.
Following prior work, we use an hmm grounded on a Gated Recurrent Uni