We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. Using a scalable data engine that incorporates human feedback and efficient models in the loop, we create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. It covers a wide range of 3.5 million common and rare concepts in the real world, and has 132.2 billion tokens that describe the concepts and their attributes. Leveraging this new dataset, we develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding. The model is trained with open-ended language prompts and locations, which allows it to generalize to various vision and language tasks with remarkable zero-shot performance, including region-text retrieval, region recognition, captioning, and question-answering. We hope that this project can serve as a foundation for vision-language artificial general intelligence research. Models and the dataset shall be released at https://github.com/OpenGVLab/All-Seeing, and demo can be seen at https://huggingface.co/spaces/OpenGVLab/all-seeing.

All-Seeing（AS）项目是一个大规模数据和模型，用于识别和理解开放世界中的所有内容；使用一个可扩展的数据引擎结合人类反馈和高效模型，在新的AS-1B数据集中标注了超过10亿个语义标签、问答配对和详细说明，覆盖了350万个现实世界中常见和罕见的概念，并提供了1322亿个描述这些概念及其属性的标记；利用该数据集，开发了全视（ASM）模型，一个用于全景视觉识别和理解的统一框架，它可以通过开放式语言提示和位置进行训练，具有非凡的零样本性能，包括区域-文本检索、区域识别、描写和问答；希望该项目能为视觉语言人工智能研究奠定基础，模型和数据集将发布在指定的URL上，演示可在指定的URL上查看。

全视计划：朝着全景视觉识别和理解开放世界迈进