Recognising actions in videos relies on labelled supervision during training,
typically the start and end times of each action instance. This supervision is
not only subjective, but also expensive to acquire. Weak video-level
supervision has been successfully exploited for recognition in untrimmed
videos, however it is challenged when the number of different