Vision-language pretraining has been shown to produce high-quality visual
encoders which transfer efficiently to downstream computer vision tasks. While
generative language models have gained widespread attention, image captioning
has thus far been mostly overlooked as a form of cross-modal pretraining in
favor of contrastive learning, especially in medical image analysis. In this
paper, we experiment with bidirectional captioning of radiology reports as a
form of pretraining and compare the quality and utility of learned embeddings
with those from contrastive pretraining methods. We optimize a CNN encoder,
transformer decoder architecture named RadTex for the radiology domain. Results
show that not only does captioning pretraining yield visual encoders that are
competitive with contrastive pretraining (CheXpert competition multi-label AUC
of 89.4%), but also that our transformer decoder is capable of generating
clinically relevant reports (captioning macro-F1 score of 0.349 using CheXpert
labeler) and responding to prompts with targeted, interactive outputs.

本研究通过双向字幕法对放射学报告进行预训练，与对比性训练方法相比，表明字幕法预训练不仅可以产生具有竞争力的视觉编码器，还可以生成具有临床相关性的报告和针对性的交互性输出。

临床准确且可解释的双向字幕模型

Bidirectional Captioning for Clinically Accurate and Interpretable  Models

In this work, we construct the largest dataset for multimodal pretraining in
Chinese, which consists of over 1.9TB images and 292GB texts that cover a wide
range of domains. We propose a cross-modal pretraining method called M6,
referring to Multi-Modality to Multi-Modality Multitask Mega-transformer, for
unified pretraining on the data of single modality and multiple modalities. We
scale the model size up to 10 billion and 100 billion parameters, and build the
largest pretrained model in Chinese. We apply the model to a series of
downstream applications, and demonstrate its outstanding performance in
comparison with strong baselines. Furthermore, we specifically design a
downstream task of text-guided image generation, and show that the finetuned M6
can create high-quality images with high resolution and abundant details.

本研究构建了最大的中文多模态预训练数据集，提出了一个跨模态预训练方法 M6，并在众多应用领域中展示了其优异性能和高质量图像生成能力。