Evaluating the compatibility between textual descriptions and corresponding
images represents a core endeavor within multi-modal research. In recent years,
a proliferation of reference-free methods, leveraging visual-language
pre-trained models (VLMs), has emerged. Empirical evidence has substantiated
that these innovative approaches exhibit a higher correlation with human
judgment, marking a significant advancement in the field. However, does a
higher correlation with human evaluations alone sufficiently denote the
complete of a metric? In response to this question, in this paper, we study if
there are any deficiencies in reference-free metrics. Specifically, inspired by
the Cobra Effect, we utilize metric scores as rewards to direct the captioning
model toward generating descriptions that closely align with the metric's
criteria. If a certain metric has flaws, it will be exploited by the model and
reflected in the generated sentences. Our findings reveal that descriptions
guided by these metrics contain significant flaws, e.g. incoherent statements
and excessive repetition. Subsequently, we propose a novel method termed
Self-Improving to rectify the identified shortcomings within these metrics. We
employ GPT-4V as an evaluative tool to assess generated sentences and the
result reveals that our approach achieves state-of-the-art (SOTA) performance.
In addition, we also introduce a challenging evaluation benchmark called Flaws
Caption to evaluate reference-free image captioning metrics comprehensively.
Our code is available at
this https URL

评估文本描述与相应图像之间的兼容性是多模态研究中的核心工作之一。本文研究了无参考指标的不足之处，并提出了一种名为 “自我完善” 的新方法来纠正这些指标的不足，并通过 GPT-4V 评估生成的句子以获得最先进的性能。此外，我们还介绍了一个具有挑战性的评估基准来全面评估无参考图像描述度量标准。

无参考图像字幕评估指标中的眼镜蛇效应

Cobra Effect in Reference-Free Image Captioning Metrics

Large-scale visual-language pre-trained models (VLPM) have proven their
excellent performance in downstream object detection for natural scenes.
However, zero-shot nuclei detection on H\&E images via VLPMs remains
underexplored. The large gap between medical images and the web-originated
text-image pairs used for pre-training makes it a challenging task. In this
paper, we attempt to explore the potential of the object-level VLPM, Grounded
Language-Image Pre-training (GLIP) model, for zero-shot nuclei detection.
Concretely, an automatic prompts design pipeline is devised based on the
association binding trait of VLPM and the image-to-text VLPM BLIP, avoiding
empirical manual prompts engineering. We further establish a self-training
framework, using the automatically designed prompts to generate the preliminary
results as pseudo labels from GLIP and refine the predicted boxes in an
iterative manner. Our method achieves a remarkable performance for label-free
nuclei detection, surpassing other comparison methods. Foremost, our work
demonstrates that the VLPM pre-trained on natural image-text pairs exhibits
astonishing potential for downstream tasks in the medical field as well. Code
will be released at this https URL

该论文探讨了如何使用以大规模自然图像文本对为预训练基础的 VLPM 模型，在医学图像检测中实现零样本细胞核检测，并提出了一种基于自动提示设计管道的框架。通过自我训练，该方法在无标签的情况下表现出优异的细胞核检测性能，并展示了 VLPM 在医学领域中的巨大潜力。

基于视觉 - 语言预训练模型的零样本核团检测

Zero-shot Nuclei Detection via Visual-Language Pre-trained Models

Open-vocabulary detection (OVD) is an object detection task aiming at
detecting objects from novel categories beyond the base categories on which the
detector is trained. Recent OVD methods rely on large-scale visual-language
pre-trained models, such as CLIP, for recognizing novel objects. We identify
the two core obstacles that need to be tackled when incorporating these models
into detector training: (1) the distribution mismatch that happens when
applying a VL-model trained on whole images to region recognition tasks; (2)
the difficulty of localizing objects of unseen classes. To overcome these
obstacles, we propose CORA, a DETR-style framework that adapts CLIP for
Open-vocabulary detection by Region prompting and Anchor pre-matching. Region
prompting mitigates the whole-to-region distribution gap by prompting the
region features of the CLIP-based region classifier. Anchor pre-matching helps
learning generalizable object localization by a class-aware matching mechanism.
We evaluate CORA on the COCO OVD benchmark, where we achieve 41.7 AP50 on novel
classes, which outperforms the previous SOTA by 2.4 AP50 even without resorting
to extra training data. When extra training data is available, we train
CORA$^+$ on both ground-truth base-category annotations and additional pseudo
bounding box labels computed by CORA. CORA$^+$ achieves 43.1 AP50 on the COCO
OVD benchmark and 28.1 box APr on the LVIS OVD benchmark.

利用 Region prompting 和 Anchor pre-matching 实现 CLIP 适应开放词汇检测任务，成功应用于目标检测并在评估中超越以前的最佳性能。