Deep learning for histopathology has been successfully used for disease classification, image segmentation and more. However, combining image and text modalities using current state-of-the-art methods has been a challenge due to the high resolution of histopathology images. Automatic report generation for histopathology images is one such challenge. In this work, we show that using an existing pre-trained Vision Transformer in a two-step process of first using it to encode 4096x4096 sized patches of the Whole Slide Image (WSI) and then using it as the encoder and an LSTM decoder for report generation, we can build a fairly performant and portable report generation mechanism that takes into account the whole of the high resolution image, instead of just the patches. We are also able to use representations from an existing powerful pre-trained hierarchical vision transformer and show its usefulness in not just zero shot classification but also for report generation.

使用一个现有的预训练Vision Transformer，通过两步过程对全幻灯片图像（WSI）的4096x4096大小的补丁进行编码，并将其作为编码器和LSTM解码器用于报告生成，我们可以构建一个相当高效和可移植的报告生成机制，考虑到整个高分辨率图像，而不仅仅是补丁。我们还能够使用来自现有强大的预训练分层Vision Transformer的表示，在零样本分类和报告生成方面显示其实用性。

基于预训练的视觉Transformer的组织病理学图像自动报告生成