Charts provide visual representations of data and are widely used for
analyzing information, addressing queries, and conveying insights to others.
Various chart-related downstream tasks have emerged recently, such as
question-answering and summarization. A common strategy to solve these tasks is
to fine-tune various models originally trained on vision tasks language.
However, such task-specific models are not capable of solving a wide range of
chart-related tasks, constraining their real-world applicability. To overcome
these challenges, we introduce ChartInstruct: a novel chart-specific
vision-language Instruction-following dataset comprising 191K instructions
generated with 71K charts. We then present two distinct systems for instruction
tuning on such datasets: (1) an end-to-end model that connects a vision encoder
for chart understanding with a LLM; and (2) a pipeline model that employs a
two-step approach to extract chart data tables and input them into the LLM. In
experiments on four downstream tasks, we first show the effectiveness of our
model--achieving a new set of state-of-the-art results. Further evaluation
shows that our instruction-tuning approach supports a wide array of real-world
chart comprehension and reasoning scenarios, thereby expanding the scope and
applicability of our models to new kinds of tasks.

通过引入 ChartInstruct 数据集和两种不同的系统，我们展示了一种针对图表相关任务的指令调节方法，提供了广泛适用性和高效性。

ChartInstruct: 图表理解和推理的指导调优

ChartInstruct: Instruction Tuning for Chart Comprehension and Reasoning

Recent accelerations in multi-modal applications have been made possible with
the plethora of image and text data available online. However, the scarcity of
analogous data in the medical field, specifically in histopathology, has halted
comparable progress. To enable similar representation learning for
histopathology, we turn to YouTube, an untapped resource of videos, offering
$1,087$ hours of valuable educational histopathology videos from expert
clinicians. From YouTube, we curate Quilt: a large-scale vision-language
dataset consisting of $768,826$ image and text pairs. Quilt was automatically
curated using a mixture of models, including large language models, handcrafted
algorithms, human knowledge databases, and automatic speech recognition. In
comparison, the most comprehensive datasets curated for histopathology amass
only around $200$K samples. We combine Quilt with datasets from other sources,
including Twitter, research papers, and the internet in general, to create an
even larger dataset: Quilt-1M, with $1$M paired image-text samples, marking it
as the largest vision-language histopathology dataset to date. We demonstrate
the value of Quilt-1M by fine-tuning a pre-trained CLIP model. Our model
outperforms state-of-the-art models on both zero-shot and linear probing tasks
for classifying new histopathology images across $13$ diverse patch-level
datasets of $8$ different sub-pathologies and cross-modal retrieval tasks.

通过从 YouTube 等多种资源中收集图像和文本，我们构建了一个大规模的视觉语言数据集 Quilt-1M，其中共包含 100 万个成对的图像和文本样本，这是到目前为止规模最大的组织学图像和文本数据集，并通过微调预训练的 CLIP 模型在 13 个 diverse patch-level 数据集和跨模态检索任务中优于最先进的模型。

Quilt-1M：组织病理学的一百万张图像文本配对

Quilt-1M: One Million Image-Text Pairs for Histopathology

In this paper, we present GEM as a General Evaluation benchmark for
Multimodal tasks. Different from existing datasets such as GLUE, SuperGLUE,
XGLUE and XTREME that mainly focus on natural language tasks, GEM is a
large-scale vision-language benchmark, which consists of GEM-I for
image-language tasks and GEM-V for video-language tasks. Comparing with
existing multimodal datasets such as MSCOCO and Flicker30K for image-language
tasks, YouCook2 and MSR-VTT for video-language tasks, GEM is not only the
largest vision-language dataset covering image-language tasks and
video-language tasks at the same time, but also labeled in multiple languages.
We also provide two baseline models for this benchmark. We will release the
dataset, code and baseline models, aiming to advance the development of
multilingual multimodal research.

本文介绍了一个新的多模态任务的通用评估基准 GEM，它是一个大规模的视觉 - 语言基准，由包括图像 - 语言任务和视频 - 语言任务的 GEM-I 和 GEM-V 组成，并标记有多种语言的数据集。我们还为此基准提供了两个基准模型，旨在推动多语言多模态研究的发展。