Existing fine-grained hashing methods typically lack code interpretability as
they compute hash code bits holistically using both global and local features.
To address this limitation, we propose ConceptHash, a novel method that
achieves sub-code level interpretability. In ConceptHash, each sub-code
corresponds to a human-understandable concept, such as an object part, and
these concepts are automatically discovered without human annotations.
Specifically, we leverage a Vision Transformer architecture and introduce
concept tokens as visual prompts, along with image patch tokens as model
inputs. Each concept is then mapped to a specific sub-code at the model output,
providing natural sub-code interpretability. To capture subtle visual
differences among highly similar sub-categories (e.g., bird species), we
incorporate language guidance to ensure that the learned hash codes are
distinguishable within fine-grained object classes while maintaining semantic
alignment. This approach allows us to develop hash codes that exhibit
similarity within families of species while remaining distinct from species in
other families. Extensive experiments on four fine-grained image retrieval
benchmarks demonstrate that ConceptHash outperforms previous methods by a
significant margin, offering unique sub-code interpretability as an additional
benefit. Code at: this https URL

ConceptHash 是一种新颖的方法，通过利用 Vision Transformer 架构和语言引导，实现了对细粒度图像检索任务中哈希码的可解释性，并在四个细粒度图像检索基准上取得了显著优于之前方法的性能。

通过概念发现实现可解释的细粒度哈希

ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery

Optical Character Recognition is a technique that converts document images
into searchable and editable text, making it a valuable tool for processing
scanned documents. While the Farsi language stands as a prominent and official
language in Asia, efforts to develop efficient methods for recognizing Farsi
printed text have been relatively limited. This is primarily attributed to the
languages distinctive features, such as cursive form, the resemblance between
certain alphabet characters, and the presence of numerous diacritics and dot
placement. On the other hand, given the substantial training sample
requirements of deep-based architectures for effective performance, the
development of such datasets holds paramount significance. In light of these
concerns, this paper aims to present a novel large-scale dataset, IDPL-PFOD2,
tailored for Farsi printed text recognition. The dataset comprises 2003541
images featuring a wide variety of fonts, styles, and sizes. This dataset is an
extension of the previously introduced IDPL-PFOD dataset, offering a
substantial increase in both volume and diversity. Furthermore, the datasets
effectiveness is assessed through the utilization of both CRNN-based and Vision
Transformer architectures. The CRNN-based model achieves a baseline accuracy
rate of 78.49% and a normalized edit distance of 97.72%, while the Vision
Transformer architecture attains an accuracy of 81.32% and a normalized edit
distance of 98.74%.

本文介绍了一种针对波斯文印刷文本识别的新型大规模数据集，该数据集包含 2003541 个图像并提供各种字体、样式和尺寸。通过使用基于 CRNN 和 Vision Transformer 的体系结构来评估数据集的有效性，CRNN-based 模型达到 78.49% 的基准准确率和 97.72% 的标准化编辑距离，而 Vision Transformer 架构达到 81.32% 的准确率和 98.74% 的标准化编辑距离。

IDPL-PFOD2：一个用于印刷波斯文光学字符识别的新的大规模数据集

IDPL-PFOD2: A New Large-Scale Dataset for Printed Farsi Optical  Character Recognition

Analyses based on the body posture are crucial for top-class athletes in many
sports disciplines. If at all, coaches label only the most important keypoints,
since manual annotations are very costly. This paper proposes a method to
detect arbitrary keypoints on the limbs and skis of professional ski jumpers
that requires a few, only partly correct segmentation masks during training.
Our model is based on the Vision Transformer architecture with a special design
for the input tokens to query for the desired keypoints. Since we use
segmentation masks only to generate ground truth labels for the freely
selectable keypoints, partly correct segmentation masks are sufficient for our
training procedure. Hence, there is no need for costly hand-annotated
segmentation masks. We analyze different training techniques for freely
selected and standard keypoints, including pseudo labels, and show in our
experiments that only a few partly correct segmentation masks are sufficient
for learning to detect arbitrary keypoints on limbs and skis.

本文介绍了一种基于视觉转换（Vision Transformer）架构和部分正确分割掩码训练的方法，可检测专业滑雪跳跃者的任意关键点。通过分析不同的训练技巧，实验证明仅需要一些部分正确的分割掩码即可学习检测肢体和滑雪板上的任意关键点，从而可以避免手工标注分割掩码的高成本。