Our objective is open-world object counting in images, where the target
object class is specified by a text description. To this end, we propose
CounTX, a class-agnostic, single-stage model using a transformer decoder
counting head on top of pre-trained joint text-image representations. CounTX is
able to count the number of instances of any class given only an image and a
text description of the target object class, and can be trained end-to-end. To
the best of our knowledge, we are the first to tackle the open-world counting
problem in this way. In addition to this model, we make the following
contributions: (i) we compare the performance of CounTX to prior work on
open-world object counting, and show that our approach exceeds the state of the
art on all measures on the FSC-147 benchmark for methods that use text to
specify the task; (ii) we present and release FSC-147-D, an enhanced version of
FSC-147 with text descriptions, so that object classes can be described with
more detailed language than their simple class names. FSC-147-D is available at
this https URL

提出了 CounTX，一种基于 transformer decoder 的单阶段模型，可对任何类别的目标物体进行计数并能够针对使用文本描述任务的方法在 FSC-147 基准测试上超越现有技术。

基于开放世界的文本特定目标计数

Open-world Text-specified Object Counting

Recently, the cross-modal pre-training task has been a hotspot because of its
wide application in various down-streaming researches including retrieval,
captioning, question answering and so on. However, exiting methods adopt a
one-stream pre-training model to explore the united vision-language
representation for conducting cross-modal retrieval, which easily suffer from
the calculation explosion. Moreover, although the conventional double-stream
structures are quite efficient, they still lack the vital cross-modal
interactions, resulting in low performances. Motivated by these challenges, we
put forward a Contrastive Cross-Modal Knowledge Sharing Pre-training (COOKIE)
to grasp the joint text-image representations. Structurally, COOKIE adopts the
traditional double-stream structure because of the acceptable time consumption.
To overcome the inherent defects of double-stream structure as mentioned above,
we elaborately design two effective modules. Concretely, the first module is a
weight-sharing transformer that builds on the head of the visual and textual
encoders, aiming to semantically align text and image. This design enables
visual and textual paths focus on the same semantics. The other one is three
specially designed contrastive learning, aiming to share knowledge between
different models. The shared cross-modal knowledge develops the study of
unimodal representation greatly, promoting the single-modal retrieval tasks.
Extensive experimental results on multi-modal matching researches that includes
cross-modal retrieval, text matching, and image retrieval reveal the superiors
in calculation efficiency and statistical indicators of our pre-training model.

本研究提出了一种名为 COOKIE 的对比交叉模态知识共享预训练方法，采用传统的双流结构并结合两个有效的模块实现文本 - 图像联合表征，旨在提高跨模态检索的计算效率和统计指标。