It is always well believed that parsing an image into constituent visual
patterns would be helpful for understanding and representing an image.
Nevertheless, there has not been evidence in support of the idea on describing
an image with a natural-language utterance. In this paper, we introduce a new
design to model a hierarchy from instance level (segmentati