Multi-view segmentation in Remote Sensing (RS) seeks to segment images from
diverse perspectives within a scene. Recent methods leverage 3D information
extracted from an Implicit Neural Field (INF), bolstering result consistency
across multiple views while using limited accounts of labels (even within 3-5
labels) to streamline labor. Nonetheless, achieving superior performance within
the constraints of limited-view labels remains challenging due to inadequate
scene-wide supervision and insufficient semantic features within the INF. To
address these. we propose to inject the prior of the visual foundation
model-Segment Anything(SAM), to the INF to obtain better results under the
limited number of training data. Specifically, we contrast SAM features between
testing and training views to derive pseudo labels for each testing view,
augmenting scene-wide labeling information. Subsequently, we introduce SAM
features via a transformer into the INF of the scene, supplementing the
semantic information. The experimental results demonstrate that our method
outperforms the mainstream method, confirming the efficacy of SAM as a
supplement to the INF for this task.

通过将可视化基础模型 Segment Anything（SAM）注入到隐式神经场模型 - INF 中，我们提出了一种新的多视图遥感图像分割方法，通过对测试视图和训练视图之间的 SAM 特征进行对比，得出每个测试视图的伪标签，从而增强整个场景的标注信息，实验证明我们的方法在有限的训练数据情况下优于主流方法，从而证实了 SAM 作为 INF 的一种补充在这一任务中的有效性。

基于 SAM 先验的多视角遥感图像分割

Multi-view Remote Sensing Image Segmentation With SAM priors

Mamba, a recent selective structured state space model, performs excellently
on long sequence modeling tasks. Mamba mitigates the modeling constraints of
convolutional neural networks and offers advanced modeling capabilities similar
to those of Transformers, through global receptive fields and dynamic
weighting. Crucially, it achieves this without incurring the quadratic
computational complexity typically associated with Transformers. Due to its
advantages over the former two mainstream foundation models, Mamba exhibits
great potential to be a visual foundation model. Researchers are actively
applying Mamba to various computer vision tasks, leading to numerous emerging
works. To help keep pace with the rapid advancements in computer vision, this
paper aims to provide a comprehensive review of visual Mamba approaches. This
paper begins by delineating the formulation of the original Mamba model.
Subsequently, our review of visual Mamba delves into several representative
backbone networks to elucidate the core insights of the visual Mamba. We then
categorize related works using different modalities, including image, video,
point cloud, multi-modal, and others. Specifically, for image applications, we
further organize them into distinct tasks to facilitate a more structured
discussion. Finally, we discuss the challenges and future research directions
for visual Mamba, providing insights for future research in this quickly
evolving area. A comprehensive list of visual Mamba models reviewed in this
work is available at this https URL

在这篇综述性文章中，我们回顾了 Mamba 模型的起源和核心见解，并将 Mamba 应用于不同的计算机视觉任务。我们对各种图像、视频、点云、多模态等应用进行了分类和组织，为未来在这个快速发展的领域中提供了挑战和研究方向。

Vision Mamba: 模型、应用和挑战综述

A Survey on Vision Mamba: Models, Applications and Challenges

Leading approaches in machine vision employ different architectures for
different tasks, trained on costly task-specific labeled datasets. This
complexity has held back progress in areas, such as robotics, where robust
task-general perception remains a bottleneck. In contrast, "foundation models"
of natural language have shown how large pre-trained neural networks can
provide zero-shot solutions to a broad spectrum of apparently distinct tasks.
Here we introduce Counterfactual World Modeling (CWM), a framework for
constructing a visual foundation model: a unified, unsupervised network that
can be prompted to perform a wide variety of visual computations. CWM has two
key components, which resolve the core issues that have hindered application of
the foundation model concept to vision. The first is structured masking, a
generalization of masked prediction methods that encourages a prediction model
to capture the low-dimensional structure in visual data. The model thereby
factors the key physical components of a scene and exposes an interface to them
via small sets of visual tokens. This in turn enables CWM's second main idea --
counterfactual prompting -- the observation that many apparently distinct
visual representations can be computed, in a zero-shot manner, by comparing the
prediction model's output on real inputs versus slightly modified
("counterfactual") inputs. We show that CWM generates high-quality readouts on
real-world images and videos for a diversity of tasks, including estimation of
keypoints, optical flow, occlusions, object segments, and relative depth. Taken
together, our results show that CWM is a promising path to unifying the
manifold strands of machine vision in a conceptually simple foundation.

引入了反事实世界建模 (Counterfactual World Modeling) 框架，构建了一个视觉基础模型：一个统一的、无监督的网络，可以提示执行各种视觉计算，结果表明 CWM 是将机器视觉的众多技术统一起来的一种很有前途的途径。