We introduce Arboretum, the largest publicly accessible dataset designed to
advance AI for biodiversity applications. This dataset, curated from the
iNaturalist community science platform and vetted by domain experts to ensure
accuracy, includes 134.6 million images, surpassing existing datasets in scale
by an order of magnitude. The dataset encompasses image-language paired data
for a diverse set of species from birds (Aves), spiders/ticks/mites
(Arachnida), insects (Insecta), plants (Plantae), fungus/mushrooms (Fungi),
snails (Mollusca), and snakes/lizards (Reptilia), making it a valuable resource
for multimodal vision-language AI models for biodiversity assessment and
agriculture research. Each image is annotated with scientific names, taxonomic
details, and common names, enhancing the robustness of AI model training.
We showcase the value of Arboretum by releasing a suite of CLIP models
trained using a subset of 40 million captioned images. We introduce several new
benchmarks for rigorous assessment, report accuracy for zero-shot learning, and
evaluations across life stages, rare species, confounding species, and various
levels of the taxonomic hierarchy.
We anticipate that Arboretum will spur the development of AI models that can
enable a variety of digital tools ranging from pest control strategies, crop
monitoring, and worldwide biodiversity assessment and environmental
conservation. These advancements are critical for ensuring food security,
preserving ecosystems, and mitigating the impacts of climate change. Arboretum
is publicly available, easily accessible, and ready for immediate use.
Please see the \href{this https URL}{project
website} for links to our data, models, and code.

介绍了 Arboretum 数据集，这是最大的公开可访问的数据集，旨在推动用于生物多样性应用的人工智能。该数据集由 iNaturalist 社区科学平台策划并得到领域专家的核实，包括 1.346 亿幅图像，规模超过现有数据集一个数量级。该数据集对鸟类、蜘蛛 / 蜱螨、昆虫、植物、真菌 / 蘑菇、蜗牛和蛇 / 蜥蜴等多种物种具有图像 - 语言配对数据，是多模态视觉 - 语言 AI 模型进行生物多样性评估和农业研究的宝贵资源。每张图像都附有科学名称、分类学细节和通用名称，增强了 AI 模型的训练鲁棒性。通过释放其中 4000 万个带有说明的图像子集训练的 CLIP 模型，展示了 Arboretum 的价值。引入了几个新的严格评估基准，报告了零样本学习的准确性以及在生命周期阶段、稀有物种、混淆物种和分类学层次不同级别的评估。预计 Arboretum 将推动能够实现多种数字工具的人工智能模型的发展，包括害虫控制策略、农作物监测、全球生物多样性评估和环境保护等。这些进展对于确保食品安全、保护生态系统和减缓气候变化的影响至关重要。Arboretum 是公开可用、易于访问且可以立即使用的。请参阅项目网站以获取有关数据、模型和代码的链接。

树木园：一个大型多模数据集为生物多样性提供 AI 支持

Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

We introduce a new challenge to test the STEM skills of neural models. The
problems in the real world often require solutions, combining knowledge from
STEM (science, technology, engineering, and math). Unlike existing datasets,
our dataset requires the understanding of multimodal vision-language
information of STEM. Our dataset features one of the largest and most
comprehensive datasets for the challenge. It includes 448 skills and 1,073,146
questions spanning all STEM subjects. Compared to existing datasets that often
focus on examining expert-level ability, our dataset includes fundamental
skills and questions designed based on the K-12 curriculum. We also add
state-of-the-art foundation models such as CLIP and GPT-3.5-Turbo to our
benchmark. Results show that the recent model advances only help master a very
limited number of lower grade-level skills (2.5% in the third grade) in our
dataset. In fact, these models are still well below (averaging 54.7%) the
performance of elementary students, not to mention near expert-level
performance. To understand and increase the performance on our dataset, we
teach the models on a training split of our dataset. Even though we observe
improved performance, the model performance remains relatively low compared to
average elementary students. To solve STEM problems, we will need novel
algorithmic innovations from the community.

我们介绍了一个新的挑战来测试神经模型的 STEM 技能，我们的数据集涵盖了 STEM 的多模式视觉语言信息，包括了 448 项技能和 1,073,146 个问题，与现有数据集相比，我们的数据集涵盖了从幼儿园到 12 年级课程的基础技能和问题，并添加了 CLIP 和 GPT-3.5-Turbo 等最新的基础模型到我们的基准测试，结果表明，最近的模型进展只有在我们数据集中的一小部分低年级技能（三年级的 2.5%）上有所帮助，事实上，这些模型的性能仍然远低于小学生的平均水平（仅平均 54.7%），更不用说接近专家级性能了，为了提高我们数据集上的模型性能，我们将模型训练在数据集的训练集上，尽管我们观察到性能有所提升，但与普通小学生相比，模型性能仍然相对较低，因此我们需要来自社区的创新算法来解决 STEM 问题。

测量神经模型的视觉 - 语言 STEM 技能

Measuring Vision-Language STEM Skills of Neural Models

Large pre-trained models have proved to be remarkable zero- and
(prompt-based) few-shot learners in unimodal vision and language tasks. We
propose MAPL, a simple and parameter-efficient method that reuses frozen
pre-trained unimodal models and leverages their strong generalization
capabilities in multimodal vision-language (VL) settings. MAPL learns a
lightweight mapping between the representation spaces of unimodal models using
aligned image-text data, and can generalize to unseen VL tasks from just a few
in-context examples. The small number of trainable parameters makes MAPL
effective at low-data and in-domain learning. Moreover, MAPL's modularity
enables easy extension to other pre-trained models. Extensive experiments on
several visual question answering and image captioning benchmarks show that
MAPL achieves superior or competitive performance compared to similar methods
while training orders of magnitude fewer parameters. MAPL can be trained in
just a few hours using modest computational resources and public datasets. We
release our code and pre-trained model weights at
this https URL

MAPL 是一种有效的参数少、可复用预训练模型并利用其在多模态视觉语言领域中的强大泛化能力的方法，能够将不同模态的模型的表示空间通过对齐的图像文本数据进行轻量级映射，从而在现场学习过程中减少训练量并产生较好的性能。