To make progress towards multi-modal AI assistants which can guide users to
achieve complex multi-step goals, we propose the task of Visual Planning for
Assistance (VPA). Given a goal briefly described in natural language, e.g.,
"make a shelf", and a video of the user's progress so far, the aim of VPA is to
obtain a plan, i.e., a sequence of actions such as "sand shelf", "paint shelf",
etc., to achieve the goal. This requires assessing the user's progress from the
untrimmed video, and relating it to the requirements of underlying goal, i.e.,
relevance of actions and ordering dependencies amongst them. Consequently, this
requires handling long video history, and arbitrarily complex action
dependencies. To address these challenges, we decompose VPA into video action
segmentation and forecasting. We formulate the forecasting step as a
multi-modal sequence modeling problem and present Visual Language Model based
Planner (VLaMP), which leverages pre-trained LMs as the sequence model. We
demonstrate that VLaMP performs significantly better than baselines w.r.t all
metrics that evaluate the generated plan. Moreover, through extensive
ablations, we also isolate the value of language pre-training, visual
observations, and goal information on the performance. We will release our
data, model, and code to enable future research on visual planning for
assistance.

本研究提出了『Visual Planning for Assistance (VPA)』的任务，通过视频中的行动段落和预测，使用预训练语言模型处理长时间的视频历史数据和复杂的行动依赖关系，从而使『multi-modal AI assistants』能够指导用户攻克复杂的多步骤目标。

预训练语言模型作为人类辅助的视觉规划器

Pretrained Language Models as Visual Planners for Human Assistance

Video action segmentation under timestamp supervision has recently received
much attention due to lower annotation costs. Most existing methods generate
pseudo-labels for all frames in each video to train the segmentation model.
However, these methods suffer from incorrect pseudo-labels, especially for the
semantically unclear frames in the transition region between two consecutive
actions, which we call ambiguous intervals. To address this issue, we propose a
novel framework from the perspective of clustering, which includes the
following two parts. First, pseudo-label ensembling generates incomplete but
high-quality pseudo-label sequences, where the frames in ambiguous intervals
have no pseudo-labels. Second, iterative clustering iteratively propagates the
pseudo-labels to the ambiguous intervals by clustering, and thus updates the
pseudo-label sequences to train the model. We further introduce a clustering
loss, which encourages the features of frames within the same action segment
more compact. Extensive experiments show the effectiveness of our method.

本文从聚类的角度提出了一种框架来解决视频动作分割中矛盾间隙带来的错误伪标签问题，并引入了聚类损失函数，使得相同动作段内的帧特征更加紧凑，实验结果表明该方法有效。

基于聚类视角的时间戳监督动作分割

Timestamp-Supervised Action Segmentation from the Perspective of  Clustering

This paper introduces a unified framework for video action segmentation via
sequence to sequence (seq2seq) translation in a fully and timestamp supervised
setup. In contrast to current state-of-the-art frame-level prediction methods,
we view action segmentation as a seq2seq translation task, i.e., mapping a
sequence of video frames to a sequence of action segments. Our proposed method
involves a series of modifications and auxiliary loss functions on the standard
Transformer seq2seq translation model to cope with long input sequences opposed
to short output sequences and relatively few videos. We incorporate an
auxiliary supervision signal for the encoder via a frame-wise loss and propose
a separate alignment decoder for an implicit duration prediction. Finally, we
extend our framework to the timestamp supervised setting via our proposed
constrained k-medoids algorithm to generate pseudo-segmentations. Our proposed
framework performs consistently on both fully and timestamp supervised
settings, outperforming or competing state-of-the-art on several datasets. Our
code is publicly available at this https URL

本文提出了一个基于序列到序列解决视频动作分割的统一框架，利用全面时间戳监督设置的 seq2seq 翻译。我们使用 类似于映射视频帧序列到动作分段序列的方法，来解决动作分割这一问题。我们提出了一系列修改和辅助损失函数，以及针对标准 Transformer seq2seq 翻译模型的模块化方法，以应对长输入序列和相对较少的视频输出序列。我们为编码器引入了辅助监督信号，提出了一个独立的对齐解码器用于隐式持续时间预测，最后通过我们提出的有限 k-medoid 算法将框架扩展到基于时间戳的监督设置，用于生成伪分割。我们的框架在完全和时间戳监督设置中表现一致，胜过或与几个数据集上的最先进算法相竞争。