This paper is about weakly supervised action segmentation, where the ground
truth specifies only a set of actions present in a training video, but not
their true temporal ordering. Prior work typically uses a classifier that
independently labels video frames for generating the pseudo g