Modern Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary
-- CLIP, which can cover most common vision tasks. However, for some special
vision task that needs dense and fine-grained vision perception, e.g.,
document-level OCR or chart understanding, especially in non-English scenarios,
the CLIP-style vocabulary may encounter low efficiency in tokenizing the vision
knowledge and even suffer out-of-vocabulary problem. Accordingly, we propose
Vary, an efficient and effective method to scale up the vision vocabulary of
LVLMs. The procedures of Vary are naturally divided into two folds: the
generation and integration of a new vision vocabulary. In the first phase, we
devise a vocabulary network along with a tiny decoder-only transformer to
produce the desired vocabulary via autoregression. In the next, we scale up the
vanilla vision vocabulary by merging the new one with the original one (CLIP),
enabling the LVLMs can quickly garner new features. Compared to the popular
BLIP-2, MiniGPT4, and LLaVA, Vary can maintain its vanilla capabilities while
enjoying more excellent fine-grained perception and understanding ability.
Specifically, Vary is competent in new document parsing features (OCR or
markdown conversion) while achieving 78.2% ANLS in DocVQA and 36.2% in MMVet.
Our code will be publicly available on the homepage.

通过提出一种名为 Vary 的新方法，可以在现代大型视觉 - 语言模型（LVLMs）中扩展视觉词汇表，从而实现更出色的细粒度感知和理解能力，特别适用于需要密集和细粒度视觉感知的特殊视觉任务，如文档级 OCR 或图表理解，而传统的视觉词汇表在非英语场景下可能会遇到效率低下或词汇表覆盖不全的问题。

Vary：为大型视觉语言模型扩展视觉词汇

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

In this report, we introduce DocXChain, a powerful open-source toolchain for
document parsing, which is designed and developed to automatically convert the
rich information embodied in unstructured documents, such as text, tables and
charts, into structured representations that are readable and manipulable by
machines. Specifically, basic capabilities, including text detection, text
recognition, table structure recognition and layout analysis, are provided.
Upon these basic capabilities, we also build a set of fully functional
pipelines for document parsing, i.e., general text reading, table parsing, and
document structurization, to drive various applications related to documents in
real-world scenarios. Moreover, DocXChain is concise, modularized and flexible,
such that it can be readily integrated with existing tools, libraries or models
(such as LangChain and ChatGPT), to construct more powerful systems that can
accomplish more complicated and challenging tasks. The code of DocXChain is
publicly available
at:~https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/Applications/DocXChain

DocXChain 是一个强大的开源工具链，用于将非结构化文档（如文本、表格和图表）中的丰富信息自动转换为可读取和可操纵的结构化表示，提供了基本功能如文本检测、文本识别、表结构识别和布局分析，并且可以与现有的工具、库或模型轻松集成，以构建更强大的系统，实现更复杂和具有挑战性的任务。

DocXChain：一个强大的开源工具链，用于文档解析及其之后的工作

DocXChain: A Powerful Open-Source Toolchain for Document Parsing and  Beyond

Information in industry, research, and the public sector is widely stored as
rendered documents (e.g., PDF files, scans). Hence, to enable downstream tasks,
systems are needed that map rendered documents onto a structured hierarchical
format. However, existing systems for this task are limited by heuristics and
are not end-to-end trainable. In this work, we introduce the Document Structure
Generator (DSG), a novel system for document parsing that is fully end-to-end
trainable. DSG combines a deep neural network for parsing (i) entities in
documents (e.g., figures, text blocks, headers, etc.) and (ii) relations that
capture the sequence and nested structure between entities. Unlike existing
systems that rely on heuristics, our DSG is trained end-to-end, making it
effective and flexible for real-world applications. We further contribute a
new, large-scale dataset called E-Periodica comprising real-world magazines
with complex document structures for evaluation. Our results demonstrate that
our DSG outperforms commercial OCR tools and, on top of that, achieves
state-of-the-art performance. To the best of our knowledge, our DSG system is
the first end-to-end trainable system for hierarchical document parsing.

在这项研究中，我们介绍了一种名为 DSG 的文档解析系统，它是一种全面可训练的端到端系统，用于将渲染文档映射到结构化的分层格式。我们通过训练 DSG 系统，使其在实际应用中具有高效和灵活的功能，并在评估中证明 DSG 优于商业 OCR 工具并且达到了最先进的性能水平。据我们所知，我们的 DSG 系统是第一个进行层次化文档解析的全面可训练系统。