Radiotherapists require accurate registration of MR/CT images to effectively
use information from both modalities. In a typical registration pipeline, rigid
or affine transformations are applied to roughly align the fixed and moving
images before proceeding with the deformation step. While recent learning-based
methods have shown promising results in the rigid/affine step, these methods
often require images with similar field-of-view (FOV) for successful alignment.
As a result, aligning images with different FOVs remains a challenging task.
Self-supervised landmark detection methods like self-supervised Anatomical
eMbedding (SAM) have emerged as a useful tool for mapping and cropping images
to similar FOVs. However, these methods are currently limited to intra-modality
use only. To address this limitation and enable cross-modality matching, we
propose a new approach called Cross-SAM. Our approach utilizes a novel
iterative process that alternates between embedding learning and CT-MRI
registration. We start by applying aggressive contrast augmentation on both CT
and MRI images to train a SAM model. We then use this SAM to identify
corresponding regions on paired images using robust grid-points matching,
followed by a point-set based affine/rigid registration, and a deformable
fine-tuning step to produce registered paired images. We use these registered
pairs to enhance the matching ability of SAM, which is then processed
iteratively. We use the final model for cross-modality matching tasks. We
evaluated our approach on two CT-MRI affine registration datasets and found
that Cross-SAM achieved robust affine registration on both datasets,
significantly outperforming other methods and achieving state-of-the-art
performance.

为了解决不同 FOV 的图像对齐问题，本文提出了一种名为 Cross-SAM 的新方法，该方法利用嵌入学习和 CT-MRI 注册的迭代过程，以实现跨模态匹配，并在 CT-MRI 仿射注册数据集上表现出鲁棒性，明显优于其他方法，达到了最先进的性能。

野外匹配：学习用于多模态图像的解剖嵌入

Matching in the Wild: Learning Anatomical Embeddings for Multi-Modality  Images

In most Vision-Language models (VL), the understanding of the image structure
is enabled by injecting the position information (PI) about objects in the
image. In our case study of LXMERT, a state-of-the-art VL model, we probe the
use of the PI in the representation and study its effect on Visual Question
Answering. We show that the model is not capable of leveraging the PI for the
image-text matching task on a challenge set where only position differs. Yet,
our experiments with probing confirm that the PI is indeed present in the
representation. We introduce two strategies to tackle this: (i) Positional
Information Pre-training and (ii) Contrastive Learning on PI using
Cross-Modality Matching. Doing so, the model can correctly classify if images
with detailed PI statements match. Additionally to the 2D information from
bounding boxes, we introduce the object's depth as new feature for a better
object localization in the space. Even though we were able to improve the model
properties as defined by our probes, it only has a negligible effect on the
downstream performance. Our results thus highlight an important issue of
multimodal modeling: the mere presence of information detectable by a probing
classifier is not a guarantee that the information is available in a
cross-modal setup.

研究了视觉语言模型中的位置信息对图像 - 文本匹配任务的影响，并提出了两种解决策略，即位置信息预训练和基于交叉模态匹配的对比学习。结果显示，即使位置信息存在，模型仍不能正确地分类具有详细位置语句的图像。

探究位置信息在视觉语言模型中的作用

Probing the Role of Positional Information in Vision-Language Models

Visible-infrared person re-identification (VI-ReID) is a challenging and
essential task, which aims to retrieve a set of person images over visible and
infrared camera views. In order to mitigate the impact of large modality
discrepancy existing in heterogeneous images, previous methods attempt to apply
generative adversarial network (GAN) to generate the modality-consisitent data.
However, due to severe color variations between the visible domain and infrared
domain, the generated fake cross-modality samples often fail to possess good
qualities to fill the modality gap between synthesized scenarios and target
real ones, which leads to sub-optimal feature representations. In this work, we
address cross-modality matching problem with Aligned Grayscale Modality (AGM),
an unified dark-line spectrum that reformulates visible-infrared dual-mode
learning as a gray-gray single-mode learning problem. Specifically, we generate
the grasycale modality from the homogeneous visible images. Then, we train a
style tranfer model to transfer infrared images into homogeneous grayscale
images. In this way, the modality discrepancy is significantly reduced in the
image space. In order to reduce the remaining appearance discrepancy, we
further introduce a multi-granularity feature extraction network to conduct
feature-level alignment. Rather than relying on the global information, we
propose to exploit local (head-shoulder) features to assist person Re-ID, which
complements each other to form a stronger feature descriptor. Comprehensive
experiments implemented on the mainstream evaluation datasets include SYSU-MM01
and RegDB indicate that our method can significantly boost cross-modality
retrieval performance against the state of the art methods.

本文提出一种基于 Aligned Grayscale Modality (AGM) 的跨模态人员重识别方法，通过生成灰度可见光图像和风格迁移转换红外图像得到一致的数据，再结合多颗粒度特征提取网络在特征级别进行对齐，能够显著提高跨模态检索表现。