In recent years, vision language pre-training frameworks have made
significant progress in natural language processing and computer vision,
achieving remarkable performance improvement on various downstream tasks.
However, when extended to point cloud data, existing works mainly focus on
building task-specific models, and fail to extract universal 3D vision-language
embedding that generalize well. We carefully investigate three common tasks in
semantic 3D scene understanding, and derive key insights into the development
of a pre-training model. Motivated by these observations, we propose a
vision-language pre-training framework 3DVLP (3D vision-language pre-training
with object contrastive learning), which transfers flexibly on 3D
vision-language downstream tasks. 3DVLP takes visual grounding as the proxy
task and introduces Object-level IoU-guided Detection (OID) loss to obtain
high-quality proposals in the scene. Moreover, we design Object-level
Cross-Contrastive alignment (OCC) task and Object-level Self-Contrastive
learning (OSC) task to align the objects with descriptions and distinguish
different objects in the scene, respectively. Extensive experiments verify the
excellent performance of 3DVLP on three 3D vision-language tasks, reflecting
its superiority in semantic 3D scene understanding.

本文提出了一种 3D 视觉语言预训练框架 3DVLP，可以在 3D 视觉语言下游任务中有很好的表现，该框架考虑了场景中物体的关联性，提出了多个任务来实现对象级交叉对齐和区分，与任务特定方法相比具有更好的泛化性能。

基于物体对比学习的视觉 - 语言预训练技术在三维场景理解中的应用

Vision-Language Pre-training with Object Contrastive Learning for 3D  Scene Understanding

Automatic Speech Recognition (ASR) is a technology that converts spoken words
into text, facilitating interaction between humans and machines. One of the
most common applications of ASR is Speech-To-Text (STT) technology, which
simplifies user workflows by transcribing spoken words into text. In the
medical field, STT has the potential to significantly reduce the workload of
clinicians who rely on typists to transcribe their voice recordings. However,
developing an STT model for the medical domain is challenging due to the lack
of sufficient speech and text datasets. To address this issue, we propose a
medical-domain text correction method that modifies the output text of a
general STT system using the Vision Language Pre-training (VLP) method. VLP
combines textual and visual information to correct text based on image
knowledge. Our extensive experiments demonstrate that the proposed method
offers quantitatively and clinically significant improvements in STT
performance in the medical field. We further show that multi-modal
understanding of image and text information outperforms single-modal
understanding using only text information.

提出一种基于 Vision Language Pre-training 方法的医疗方面的文本校正方法，以解决由于数据不足而难以开发医学领域的语音转文本模型的问题，并展示多模态理解图像和文本信息优于仅使用文本信息的单模态的性能。

使用视语言预训练模型提高医学语音转文本的准确性

Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model

Vision language pre-training aims to learn alignments between vision and
language from a large amount of data. We proposed multi-grained vision language
pre-training, a unified approach which can learn vision language alignments in
multiple granularity. This paper advances the proposed method by unifying image
and video encoding in one model and scaling up the model with large-scale data.
We present X$^2$-VLM, a pre-trained VLM with a modular architecture for both
image-text tasks and video-text tasks. Experiment results show that X$^2$-VLM
performs the best on base and large scale for both image-text and video-text
tasks, making a good trade-off between performance and model scale. Moreover,
we show that the modular design of X$^2$-VLM results in high transferability
for X$^2$-VLM to be utilized in any language or domain. For example, by simply
replacing the text encoder with XLM-R, X$^2$-VLM outperforms state-of-the-art
multilingual multi-modal pre-trained models without any multilingual
pre-training. The code and pre-trained models will be available at
this http URL.

这篇论文提出了一种名为 multi-grained vision language pre-training 的视觉语言联合预训练方法，它可以在多个粒度上学习视觉语言对齐。该论文还提出了一个名为 X$^2$-VLM 的预训练模型，它采用了模块化架构，可在图像文本任务和视频文本任务中实现最佳性能和模型规模间的良好平衡，并显示了其高可转移性，可以在任何语言或领域中使用。