In this paper, we describe a first publicly available fine-grained product
recognition dataset based on leaflet images. Using advertisement leaflets,
collected over several years from different European retailers, we provide a
total of 41.6k manually annotated product images in 832 classes. Further, we
investigate three different approaches for this fine-grained product
classification task, Classification by Image, by Text, as well as by Image and
Text. The approach "Classification by Text" uses the text extracted directly
from the leaflet product images. We show, that the combination of image and
text as input improves the classification of visual difficult to distinguish
products. The final model leads to an accuracy of 96.4% with a Top-3 score of
99.2%. We release our code at
this https URL

本文研究使用不同欧洲零售商家广告宣传中收集的宣传单图像，构建了一个由 41.6k 手动注释的 832 种产品类别的细粒度产品识别数据集，最终展示了在图像分类任务中使用图像和文本作为输入具有提高识别性能的效果。

宣传单广告上的精细化产品分类

Fine-Grained Product Classification on Leaflet Advertisements

This paper presents our work on the Situated Interactive MultiModal
Conversations 2.0 challenge held at Dialog State Tracking Challenge 10. SIMMC
2.0 includes 4 subtasks, and we introduce our multimodal approaches for the
subtask \#1, \#2 and the generation of subtask \#4. SIMMC 2.0 dataset is a
multimodal dataset containing image and text information, which is more
challenging than the problem of only text-based conversations because it must
be solved by understanding the relationship between image and text. Therefore,
since there is a limit to solving only text models such as BERT or GPT2, we
propose a multimodal model combining image and text. We first pretrain the
multimodal model to understand the relationship between image and text, then
finetune our model for each task. We achieve the 3rd best performance in
subtask \#1, \#2 and a runner-up in the generation of subtask \#4. The source
code is available at this https URL

本文介绍了我们在 Dialog State Tracking Challenge 10 上进行的 Situated Interactive MultiModal Conversations 2.0 挑战中的工作和方法，提出了一种结合图像和文本的多模态模型，并对 SIMMC 2.0 数据集进行了挑战。通过预先训练模型，我们在 subtask＃1，＃2 中取得了第三佳表现，并在生成 subtask＃4 中获得亚军。

使用预训练单模型进行 SIMMC 2.0 的多模态交互

Multimodal Interactions Using Pretrained Unimodal Models for SIMMC 2.0

Pre-Trained Vision-Language Models (VL-PTMs) have shown promising
capabilities in grounding natural language in image data, facilitating a broad
variety of cross-modal tasks. However, we note that there exists a significant
gap between the objective forms of model pre-training and fine-tuning,
resulting in a need for large amounts of labeled data to stimulate the visual
grounding capability of VL-PTMs for downstream tasks. To address the challenge,
we present Cross-modal Prompt Tuning (CPT, alternatively, Colorful Prompt
Tuning), a novel paradigm for tuning VL-PTMs, which reformulates visual
grounding into a fill-in-the-blank problem with color-based co-referential
markers in image and text, maximally mitigating the gap. In this way, CPT
enables strong few-shot and even zero-shot visual grounding capabilities of
VL-PTMs. Comprehensive experimental results show that the prompt-tuned VL-PTMs
outperform their fine-tuned counterparts by a large margin (e.g., 17.3%
absolute accuracy improvement, and 73.8% relative standard deviation reduction
on average with one shot in RefCOCO evaluation). We make the data and code for
this paper publicly available at this https URL

该研究提出 Cross-modal Prompt Tuning，一种基于图像和文本的填空问题的视觉定位模型调参范式，能够在少量标记数据下使模型具有强大的零样本或少样本学习能力，实现了视觉与语言的理解与应用。