Vision-and-Language Navigation (VLN) stands as a key research problem of
Embodied AI, aiming at enabling agents to navigate in unseen environments
following linguistic instructions. In this field, generalization is a
long-standing challenge, either to out-of-distribution scenes or from Sim to
Real. In this paper, we propose NaVid, a video-based large vision language
model (VLM), to mitigate such a generalization gap. NaVid makes the first
endeavour to showcase the capability of VLMs to achieve state-of-the-art level
navigation performance without any maps, odometer and depth inputs. Following
human instruction, NaVid only requires an on-the-fly video stream from a
monocular RGB camera equipped on the robot to output the next-step action. Our
formulation mimics how humans navigate and naturally gets rid of the problems
introduced by odometer noises, and the Sim2Real gaps from map or depth inputs.
Moreover, our video-based approach can effectively encode the historical
observations of robots as spatio-temporal contexts for decision-making and
instruction following. We train NaVid with 550k navigation samples collected
from VLN-CE trajectories, including action-planning and instruction-reasoning
samples, along with 665k large-scale web data. Extensive experiments show that
NaVid achieves SOTA performance in simulation environments and the real world,
demonstrating superior cross-dataset and Sim2Real transfer. We thus believe our
proposed VLM approach plans the next step for not only the navigation agents
but also this research field.

NaVid 是一个基于视频的大型视觉语言模型，通过动态的视频流输入，无需地图、测距仪和深度信息，实现了最先进水平的导航性能，解决了里程计噪声和模拟环境到真实环境之间的缺陷，同时有效地利用机器人的历史观察作为决策和指令遵循的时空背景，通过对 550k 个导航样本和 665k 个网络数据的训练，在模拟环境和真实世界中取得了非常好的性能，为导航代理和整个研究领域规划了下一步。

基于视频的 VLM 为视觉与语言导航规划下一步

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language  Navigation

Wearable sensors such as Inertial Measurement Units (IMUs) are often used to
assess the performance of human exercise. Common approaches use handcrafted
features based on domain expertise or automatically extracted features using
time series analysis. Multiple sensors are required to achieve high
classification accuracy, which is not very practical. These sensors require
calibration and synchronization and may lead to discomfort over longer time
periods. Recent work utilizing computer vision techniques has shown similar
performance using video, without the need for manual feature engineering, and
avoiding some pitfalls such as sensor calibration and placement on the body. In
this paper, we compare the performance of IMUs to a video-based approach for
human exercise classification on two real-world datasets consisting of Military
Press and Rowing exercises. We compare the performance using a single camera
that captures video in the frontal view versus using 5 IMUs placed on different
parts of the body. We observe that an approach based on a single camera can
outperform a single IMU by 10 percentage points on average. Additionally, a
minimum of 3 IMUs are required to outperform a single camera. We observe that
working with the raw data using multivariate time series classifiers
outperforms traditional approaches based on handcrafted or automatically
extracted features. Finally, we show that an ensemble model combining the data
from a single camera with a single IMU outperforms either data modality. Our
work opens up new and more realistic avenues for this application, where a
video captured using a readily available smartphone camera, combined with a
single sensor, can be used for effective human exercise classification.

本文比较了基于惯性测量单元 (IMU) 和基于视频的方法在军事按压和划船运动的人体运动分类中的表现，发现单个摄像头能比单个 IMU 提高 10 个百分点的分类准确率，而至少需要 3 个 IMU 才能超越单个摄像头。同时，使用基于多变量时间序列分类器进行原始数据处理的方法优于基于手工特征或自动提取特征的传统方法。最后，将单个摄像头和单个 IMU 的数据组合起来能超越任一数据模态，为使用智能手机摄像头和单一传感器进行有效的人体运动分类开辟了新的、更现实的途径。

可穿戴传感器与视频数据捕捉用于人体运动分类的研究

An Examination of Wearable Sensors and Video Data Capture for Human  Exercise Classification

In this paper, we first tackle the problem of pedestrian attribute
recognition by video-based approach. The challenge mainly lies in spatial and
temporal modeling and how to integrating them for effective and dynamic
pedestrian representation. To solve this problem, a novel multi-task model
based on the conventional neural network and temporal attention strategy is
proposed. Since publicly available dataset is rare, two new large-scale video
datasets with expanded attribute definition are presented, on which the
effectiveness of both video-based pedestrian attribute recognition methods and
the proposed new network architecture is well demonstrated. The two datasets
are published on this http URL

本研究提出了一种基于视频的多任务模型与时间注意策略相结合的新网络结构，解决行人属性识别方面的挑战。同时，本文还公开发布了两个新的大规模视频数据集，用于展示该方法的有效性。