Humans describe images in terms of nouns and adjectives while algorithms operate on images represented as sets of pixels. Bridging this gap between how we would like to access images versus their typical representation is the goal of image parsing. In this paper we propose treating nouns as object labels and adjectives as visual attributes. This allows us to formulate the image parsing problem as one of jointly estimating per-pixel object and attribute labels from a set of training images. We propose an efficient (interactive time) solution to this problem. Using the extracted attribute labels as handles, our system empowers a user to verbally refine the results. This enables hands free parsing of an image into pixel-wise object/attribute labels that correspond to human semantics. Verbally selecting objects of interests enables a novel and natural interaction modality that can possibly be used to interact with new generation devices (e.g., smart phones and Google glasses). We demonstrate our system on a large number of real-world images with varying complexity and understand the tradeoffs compared to traditional mouse-based interactions using both a user study and large scale quantitative evaluation.

通过将名词视为对象标签和形容词视为视觉属性标签，我们提出了一种以联合估计每像素对象和属性标签的方式来解决图像解析问题的高效解决方案，并演示了其可以用于交互新一代设备。

ImageSpirit: 口述引导下的图像解析