People with blindness and low vision (pBLV) encounter substantial challenges
when it comes to comprehensive scene recognition and precise object
identification in unfamiliar environments. Additionally, due to the vision
loss, pBLV have difficulty in accessing and identifying potential tripping
hazards on their own. In this paper, we present a pioneering approach that
leverages a large vision-language model to enhance visual perception for pBLV,
offering detailed and comprehensive descriptions of the surrounding
environments and providing warnings about the potential risks. Our method
begins by leveraging a large image tagging model (i.e., Recognize Anything
(RAM)) to identify all common objects present in the captured images. The
recognition results and user query are then integrated into a prompt, tailored
specifically for pBLV using prompt engineering. By combining the prompt and
input image, a large vision-language model (i.e., InstructBLIP) generates
detailed and comprehensive descriptions of the environment and identifies
potential risks in the environment by analyzing the environmental objects and
scenes, relevant to the prompt. We evaluate our approach through experiments
conducted on both indoor and outdoor datasets. Our results demonstrate that our
method is able to recognize objects accurately and provide insightful
descriptions and analysis of the environment for pBLV.

本文提出了一种创新的方法，利用大型视觉语言模型增强盲人和视力低下人士的视觉感知，提供周围环境的详细综合描述并警示潜在风险。该方法通过整合图像识别结果和用户查询，使用大型视觉语言模型根据提示生成环境的详细描述，并通过分析环境对象和场景来识别潜在风险。实验结果表明该方法能够准确识别对象并为盲人和视力低下人士提供深入的环境描述和分析。

VisPercep：一种增强视觉感知能力的视觉语言方法（面向盲人和视力障碍人群）

VisPercep: A Vision-Language Approach to Enhance Visual Perception for  People with Blindness and Low Vision

Common fully glazed facades and transparent objects present architectural
barriers and impede the mobility of people with low vision or blindness, for
instance, a path detected behind a glass door is inaccessible unless it is
correctly perceived and reacted. However, segmenting these safety-critical
objects is rarely covered by conventional assistive technologies. To tackle
this issue, we construct a wearable system with a novel dual-head Transformer
for Transparency (Trans4Trans) model, which is capable of segmenting general
and transparent objects and performing real-time wayfinding to assist people
walking alone more safely. Especially, both decoders created by our proposed
Transformer Parsing Module (TPM) enable effective joint learning from different
datasets. Besides, the efficient Trans4Trans model composed of symmetric
transformer-based encoder and decoder, requires little computational expenses
and is readily deployed on portable GPUs. Our Trans4Trans model outperforms
state-of-the-art methods on the test sets of Stanford2D3D and Trans10K-v2
datasets and obtains mIoU of 45.13% and 75.14%, respectively. Through various
pre-tests and a user study conducted in indoor and outdoor scenarios, the
usability and reliability of our assistive system have been extensively
verified.

提出 Trans4Trans 模型，利用双头转换器来分割常见的全玻璃幕墙、透明物体，进行实时的辅助导航，有效提升低视力人士的行动能力。模型基于对称变压器编码器和解码器，花费较少计算复杂度，可在可携带的 GPU 上轻松部署。在 Stanford2D3D 和 Trans10K-v2 数据集上表现优于最先进方法，并获得了 45.13％和 75.14％的 mIoU。通过各种预测试和用户研究来验证辅助系统的可用性和可靠性。