We consider the problem of segmenting image regions given a natural language phrase, and study it on a novel dataset of 77,262 images and 345,486 phrase-region pairs. Our dataset is collected on top of the Visual Genome dataset and uses the existing annotations to generate a challenging set of referring phrases for which the corresponding regions are manually annotated. Phrases in our dataset correspond to multiple regions and describe a large number of object and stuff categories as well as their attributes such as color, shape, parts, and relationships with other entities in the image. Our experiments show that the scale and diversity of concepts in our dataset poses significant challenges to the existing state-of-the-art. We systematically handle the long-tail nature of these concepts and present a modular approach to combine category, attribute, and relationship cues that outperforms existing approaches.

通过对Visual Genome数据集的扩充，我们将自然语言短语与图像区域进行分割，并处理大量物体和结构类别及其属性描述，包括颜色、形状、部分以及与图像中其他实体的关系，提出一种模块化的方法来结合类别、属性和关系线索以优化目前状况下的图像分割。

PhraseCut: 野外语言图像分割