We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel
pre-training paradigm for Vision-Language Models using data from large-scale
web screenshot rendering. Using web screenshots unlocks a treasure trove of
visual and textual cues that are not present in using image-text pairs. In S4,
we leverage the inherent tree-structured hierarchy of HTML elements and the
spatial localization to carefully design 10 pre-training tasks with large scale
annotated data. These tasks resemble downstream tasks across different domains
and the annotations are cheap to obtain. We demonstrate that, compared to
current screenshot pre-training objectives, our innovative pre-training method
significantly enhances performance of image-to-text model in nine varied and
popular downstream tasks - up to 76.1% improvements on Table Detection, and at
least 1% on Widget Captioning.

提出了一种新的预训练范式 —— 基于网络截图的强监督预训练（S4），利用大规模网络截图渲染的数据进行视觉语言模型的预训练。通过使用网络截图，可以获取在图像 - 文本对中不存在的丰富的视觉和文本线索。在 S4 中，利用 HTML 元素的树状层次结构和空间定位，精心设计了 10 个具有大规模注释数据的预训练任务。这些任务类似于不同领域的下游任务，而且注释成本较低。实验证明，与当前的截图预训练目标相比，我们的创新预训练方法显著提高了图像 - 文本模型在九个多样化和热门的下游任务上的性能 —— 在表格检测上提高了 76.1%，在小部件字幕上提高了至少 1%。

丰富监督提升视觉 - 语言预训练

Enhancing Vision-Language Pre-training with Rich Supervisions

This paper investigates video game identification through single screenshots,
utilizing five convolutional neural network (CNN) architectures (MobileNet,
DenseNet, EfficientNetB0, EfficientNetB2, and EfficientNetB3) across 22 home
console systems, spanning from Atari 2600 to PlayStation 5. Confirming the
hypothesis, CNNs autonomously extract image features, enabling the
identification of game titles from screenshots without additional features.
Using ImageNet pre-trained weights, EfficientNetB3 achieves the highest average
accuracy (74.51%), while DenseNet169 excels in 14 of the 22 systems. Employing
alternative initial weights from another screenshots dataset boosts accuracy
for EfficientNetB2 and EfficientNetB3, with the latter reaching a peak accuracy
of 76.36% and demonstrating reduced convergence epochs from 23.7 to 20.5 on
average. Overall, the combination of optimal architecture and weights attains
77.67% accuracy, primarily led by EfficientNetB3 in 19 systems. These findings
underscore the efficacy of CNNs in video game identification through
screenshots.

通过使用五种卷积神经网络（MobileNet、DenseNet、EfficientNetB0、EfficientNetB2 和 EfficientNetB3），对 22 种家用游戏主机系统（从 Atari 2600 到 PlayStation 5）的单个截图进行视频游戏识别的研究，确认了 CNN 在截图中提取图像特征的能力，实现了在没有额外特征的情况下从截图中识别游戏标题。利用 ImageNet 预训练权重，EfficientNetB3 实现了最高的平均准确性（74.51%），而 DenseNet169 在 22 个系统中的 14 个系统表现出色。使用来自另一个截图数据集的替代初始权重提高了 EfficientNetB2 和 EfficientNetB3 的准确性，后者在平均 23.7 个收敛周期降低到 20.5 个周期，并达到了最高准确性 76.36%。总体而言，通过优化架构和权重的组合，主要由 EfficientNetB3 在 19 个系统中领先，实现了 77.67% 的准确性，这些发现强调了 CNN 在通过截图进行视频游戏识别方面的有效性。