Recent years have seen a surge of interest in anomaly detection for tackling
industrial defect detection, event detection, etc. However, existing
unsupervised anomaly detectors, particularly those for the vision modality,
face significant challenges due to redundant information and sparse latent
space. Conversely, the language modality performs well due to its relatively
single data. This paper tackles the aforementioned challenges for vision
modality from a multimodal point of view. Specifically, we propose Cross-modal
Guidance (CMG), which consists of Cross-modal Entropy Reduction (CMER) and
Cross-modal Linear Embedding (CMLE), to tackle the redundant information issue
and sparse space issue, respectively. CMER masks parts of the raw image and
computes the matching score with the text. Then, CMER discards irrelevant
pixels to make the detector focus on critical contents. To learn a more compact
latent space for the vision anomaly detector, CMLE learns a correlation
structure matrix from the language modality, and then the latent space of
vision modality will be learned with the guidance of the matrix. Thereafter,
the vision latent space will get semantically similar images closer. Extensive
experiments demonstrate the effectiveness of the proposed methods.
Particularly, CMG outperforms the baseline that only uses images by 16.81%.
Ablation experiments further confirm the synergy among the proposed methods, as
each component depends on the other to achieve optimal performance.

该论文提出了跨模态引导（CMG）方法，通过跨模态熵减少（CMER）和跨模态线性嵌入（CMLE）来解决视觉模态中多余信息和稀疏空间的问题，实验证明该方法优于仅使用图像的基准方法 16.81%。

利用语言模态的指导改进视觉异常检测

Improving Vision Anomaly Detection with the Guidance of Language  Modality

Contrastive language-image pre-training (CLIP) has demonstrated remarkable
zero-shot classification ability, namely image classification using novel text
labels. Existing works have attempted to enhance CLIP by fine-tuning on
downstream tasks, but these have inadvertently led to performance degradation
on unseen classes, thus harming zero-shot generalization. This paper aims to
address this challenge by leveraging readily available image-text pairs from an
external dataset for cross-modal guidance during inference. To this end, we
propose X-MoRe, a novel inference method comprising two key steps: (1)
cross-modal retrieval and (2) modal-confidence-based ensemble. Given a query
image, we harness the power of CLIP's cross-modal representations to retrieve
relevant textual information from an external image-text pair dataset. Then, we
assign higher weights to the more reliable modality between the original query
image and retrieved text, contributing to the final prediction. X-MoRe
demonstrates robust performance across a diverse set of tasks without the need
for additional training, showcasing the effectiveness of utilizing cross-modal
features to maximize CLIP's zero-shot ability.

通过跨模态引导和模态置信度集成，X-MoRe 方法利用 CLIP 的跨模态表示能力，从外部图文对数据集中检索相关的文本信息，并通过赋予可靠性更高的模态对最终预测产生贡献，从而在多样化的任务中展示了稳健的性能，充分发挥了 CLIP 的零样本分类能力。

跨模态检索遇见推理：通过跨模态检索提升零样本分类

Cross-Modal Retrieval Meets Inference:Improving Zero-Shot Classification  with Cross-Modal Retrieval

Diffusion generative models have recently greatly improved the power of
text-conditioned image generation. Existing image generation models mainly
include text conditional diffusion model and cross-modal guided diffusion
model, which are good at small scene image generation and complex scene image
generation respectively. In this work, we propose a simple yet effective
approach, namely UPainting, to unify simple and complex scene image generation,
as shown in Figure 1. Based on architecture improvements and diverse guidance
schedules, UPainting effectively integrates cross-modal guidance from a
pretrained image-text matching model into a text conditional diffusion model
that utilizes a pretrained Transformer language model as the text encoder. Our
key findings is that combining the power of large-scale Transformer language
model in understanding language and image-text matching model in capturing
cross-modal semantics and style, is effective to improve sample fidelity and
image-text alignment of image generation. In this way, UPainting has a more
general image generation capability, which can generate images of both simple
and complex scenes more effectively. To comprehensively compare text-to-image
models, we further create a more general benchmark, UniBench, with well-written
Chinese and English prompts in both simple and complex scenes. We compare
UPainting with recent models and find that UPainting greatly outperforms other
models in terms of caption similarity and image fidelity in both simple and
complex scenes. UPainting project page https://upainting.github.io/.

本文介绍了 UPainting 这个同时适用于简单和复杂场景图像生成的模型，利用预训练的 Transformer 语言模型作为文本编码器，结合预训练的图像 - 文本匹配模型进行跨模态引导，提高了生成图像的样本保真度和图像 - 文本对齐程度。在中英文简单和复杂场景的对比实验中，UPainting 相对于其他模型表现得更加优异。