Image segmentation, the process of partitioning an image into meaningful
regions, plays a pivotal role in computer vision and medical imaging
applications. Unsupervised segmentation, particularly in the absence of labeled
data, remains a challenging task due to the inter-class similarity and
variations in intensity and resolution. In this study, we extract high-level
features of the input image using pretrained vision transformer. Subsequently,
the proposed method leverages the underlying graph structures of the images,
seeking to discover and delineate meaningful boundaries using graph neural
networks and modularity based optimization criteria without relying on
pre-labeled training data. Experimental results on benchmark datasets
demonstrate the effectiveness and versatility of the proposed approach,
showcasing competitive performance compared to the state-of-the-art
unsupervised segmentation methods. This research contributes to the broader
field of unsupervised medical imaging and computer vision by presenting an
innovative methodology for image segmentation that aligns with real-world
challenges. The proposed method holds promise for diverse applications,
including medical imaging, remote sensing, and object recognition, where
labeled data may be scarce or unavailable. The github repository of the code is
available on [this https URL]

利用预训练视觉转换模型提取输入图像的高级特征，然后利用图神经网络和基于模块性的优化准则，无需依赖预先标记的训练数据，从图像中提取和划分有意义的边界，实现了竞争性能，进而对无监督医学图像和计算机视觉领域做出了贡献。

UnSegGNet: 无监督图神经网络图像分割

UnSegGNet: Unsupervised Image Segmentation using Graph Neural Networks

This work explores the performance of a large video understanding foundation
model on the downstream task of human fall detection on untrimmed video and
leverages a pretrained vision transformer for multi-class action detection,
with classes: "Fall", "Lying" and "Other/Activities of daily living (ADL)". A
method for temporal action localization that relies on a simple cutup of
untrimmed videos is demonstrated. The methodology includes a preprocessing
pipeline that converts datasets with timestamp action annotations into labeled
datasets of short action clips. Simple and effective clip-sampling strategies
are introduced. The effectiveness of the proposed method has been empirically
evaluated on the publicly available High-Quality Fall Simulation Dataset
(HQFSD). The experimental results validate the performance of the proposed
pipeline. The results are promising for real-time application, and the falls
are detected on video level with a state-of-the-art 0.96 F1 score on the HQFSD
dataset under the given experimental settings. The source code will be made
available on GitHub.

基于大型视频理解模型，本研究探讨了在未修剪视频中进行人类跌倒检测的性能，并利用预训练的视觉变换器进行多类别动作检测，包括 “跌倒”、“躺下” 和 “其他 / 日常活动”。方法中介绍了一种基于未修剪视频简单截取的时间动作定位方法，并引入了简单而有效的剪辑采样策略。实验结果验证了该方法的性能，表明在给定的实验设置下，实时应用上能以 0.96 的 F1 分数检测到跌倒事件。源代码将在 GitHub 上提供。

切割与检测：使用大型基础视频理解模型对切割未修剪视频进行人类跌倒检测

Cutup and Detect: Human Fall Detection on Cutup Untrimmed Videos Using a  Large Foundational Video Understanding Model

Towards flexible object-centric visual perception, we propose a one-shot
instance-aware object keypoint (OKP) extraction approach, AnyOKP, which
leverages the powerful representation ability of pretrained vision transformer
(ViT), and can obtain keypoints on multiple object instances of arbitrary
category after learning from a support image. An off-the-shelf petrained ViT is
directly deployed for generalizable and transferable feature extraction, which
is followed by training-free feature enhancement. The best-prototype pairs
(BPPs) are searched for in support and query images based on appearance
similarity, to yield instance-unaware candidate keypoints.Then, the entire
graph with all candidate keypoints as vertices are divided to sub-graphs
according to the feature distributions on the graph edges. Finally, each
sub-graph represents an object instance. AnyOKP is evaluated on real object
images collected with the cameras of a robot arm, a mobile robot, and a
surgical robot, which not only demonstrates the cross-category flexibility and
instance awareness, but also show remarkable robustness to domain shift and
viewpoint change.

通过利用预训练视觉转换器（ViT）的强大表示能力，我们提出了一种针对灵活的以物体为中心的视觉感知的一次性实例感知对象关键点提取方法（AnyOKP），并可以在学习支持图像后，为任意类别的多个对象实例获得关键点。我们直接部署现成的预训练 ViT 进行通用化和可转移的特征提取，并通过训练无需增强特征。根据外观相似性在支持图像和查询图像中搜索最佳原型对（BPPs），以产生不考虑实例的候选关键点。然后，将包含所有候选关键点的整个图根据图边上的特征分布划分为子图。最后，每个子图代表一个对象实例。AnyOKP 在使用机械臂、移动机器人和外科手术机器人的相机收集的真实物体图像上进行了评估，不仅展示了跨类别的灵活性和实例感知性，还展示了对领域转移和视角变化的显著稳健性。