In this work, we show that text-to-image generative models can be 'inverted'
to assess their own text-image understanding capabilities in a completely
automated manner.
Our method, called SelfEval, uses the generative model to compute the
likelihood of real images given text prompts, making the generative model
directly applicable to discriminative tasks.
Using SelfEval, we repurpose standard datasets created for evaluating
multimodal text-image discriminative models to evaluate generative models in a
fine-grained manner: assessing their performance on attribute binding, color
recognition, counting, shape recognition, spatial understanding.
To the best of our knowledge SelfEval is the first automated metric to show a
high degree of agreement for measuring text-faithfulness with the gold-standard
human evaluations across multiple models and benchmarks.
Moreover, SelfEval enables us to evaluate generative models on challenging
tasks such as Winoground image-score where they demonstrate competitive
performance to discriminative models.
We also show severe drawbacks of standard automated metrics such as
CLIP-score to measure text faithfulness on benchmarks such as DrawBench, and
how SelfEval sidesteps these issues.
We hope SelfEval enables easy and reliable automated evaluation for diffusion
models.

使用文本到图像生成模型的自动化方法 SelfEval，可用于评估生成模型在多模态文本 - 图像辨别任务中的性能，并展示其与人工评估结果在文本忠实性上具有高度一致性。

SelfEval：利用生成模型的判别性质进行评估

SelfEval: Leveraging the discriminative nature of generative models for  evaluation

An automated metric to evaluate dialogue quality is vital for optimizing data
driven dialogue management. The common approach of relying on explicit user
feedback during a conversation is intrusive and sparse. Current models to
estimate user satisfaction use limited feature sets and employ annotation
schemes with limited generalizability to conversations spanning multiple
domains. To address these gaps, we created a new Response Quality annotation
scheme, introduced five new domain-independent feature sets and experimented
with six machine learning models to estimate User Satisfaction at both turn and
dialogue level.
Response Quality ratings achieved significantly high correlation (0.76) with
explicit turn-level user ratings. Using the new feature sets we introduced,
Gradient Boosting Regression model achieved best (rating [1-5]) prediction
performance on 26 seen (linear correlation ~0.79) and one new multi-turn domain
(linear correlation 0.67). We observed a 16% relative improvement (68% -> 79%)
in binary ("satisfactory/dissatisfactory") class prediction accuracy of a
domain-independent dialogue-level satisfaction estimation model after including
predicted turn-level satisfaction ratings as features.

本文提出了一种新的基于响应质量注释方法的自动化指标，通过引入五个新的与领域无关的特性集，实现了在单轮和对话层面上估计用户满意度的机器学习模型，并取得了较高的预测表现。

通过用户满意度估计进行多域会话质量评估

Multi-domain Conversation Quality Evaluation via User Satisfaction  Estimation

Automatically describing an image with a sentence is a long-standing
challenge in computer vision and natural language processing. Due to recent
progress in object detection, attribute classification, action recognition,
etc., there is renewed interest in this area. However, evaluating the quality
of descriptions has proven to be challenging. We propose a novel paradigm for
evaluating image descriptions that uses human consensus. This paradigm consists
of three main parts: a new triplet-based method of collecting human annotations
to measure consensus, a new automated metric (CIDEr) that captures consensus,
and two new datasets: PASCAL-50S and ABSTRACT-50S that contain 50 sentences
describing each image. Our simple metric captures human judgment of consensus
better than existing metrics across sentences generated by various sources. We
also evaluate five state-of-the-art image description approaches using this new
protocol and provide a benchmark for future comparisons. A version of CIDEr
named CIDEr-D is available as a part of MS COCO evaluation server to enable
systematic evaluation and benchmarking.

本文提出了一种基于人类共识的评估图像描述的新方法，包括新的基于三元组的人类注释方法、一种捕捉共识的新自动化指标（CIDEr）和包含 50 个对每个图像进行描述的句子的两个新数据集（PASCAL-50S 和 ABSTRACT-50S）。使用这种新协议评估了五种最先进的图像描述方法，并提供了未来比较的基准。