deep neural networks deliver state-of-the-art visual recognition, but they
rely on large datasets, which are time-consuming to annotate. These datasets
are typically annotated in two stages: (1) determining the presence of object
classes at the image level and (2) marking the spatial e