Despite constituting 65% of all internet traffic in 2023, video content is
underrepresented in generative AI research. Meanwhile, recent large language
models (LLMs) have become increasingly integrated with capabilities in the
visual modality. Integrating video with LLMs is a natural next step, so how can
this gap be bridged? To advance video reasoning, we propose a new research
direction of VideoCOT on video keyframes, which leverages the multimodal
generative abilities of vision-language models to enhance video reasoning while
reducing the computational complexity of processing hundreds or thousands of
frames. We introduce VIP, an inference-time dataset that can be used to
evaluate VideoCOT, containing 1) a variety of real-life videos with keyframes
and corresponding unstructured and structured scene descriptions, and 2) two
new video reasoning tasks: video infilling and scene prediction. We benchmark
various vision-language models on VIP, demonstrating the potential to use
vision-language models and LLMs to enhance video chain of thought reasoning.

为了提高视频推理的能力和降低处理数百或数千帧的计算复杂度，我们提出了 VideoCOT 的新研究方向，旨在利用视觉语言模型的多模式生成能力对视频关键帧进行增强。我们引入了 VIP 数据集，其中包含各种现实生活视频和场景描述，以及两个新的视频推理任务：视频填充和场景预测，评估了各种视觉语言模型在 VIP 上的表现，证明了利用视觉语言模型和 LLM 提高视频链推理的潜力。

逐帧思考：使用视频填充和预测评估视频思维链

Let's Think Frame by Frame: Evaluating Video Chain of Thought with Video  Infilling and Prediction

Reward and representation learning are two long-standing challenges for
learning an expanding set of robot manipulation skills from sensory
observations. Given the inherent cost and scarcity of in-domain, task-specific
robot data, learning from large, diverse, offline human videos has emerged as a
promising path towards acquiring a generally useful visual representation for
control; however, how these human videos can be used for general-purpose reward
learning remains an open question. We introduce
$\textbf{V}$alue-$\textbf{I}$mplicit $\textbf{P}$re-training (VIP), a
self-supervised pre-trained visual representation capable of generating dense
and smooth reward functions for unseen robotic tasks. VIP casts representation
learning from human videos as an offline goal-conditioned reinforcement
learning problem and derives a self-supervised dual goal-conditioned
value-function objective that does not depend on actions, enabling pre-training
on unlabeled human videos. Theoretically, VIP can be understood as a novel
implicit time contrastive objective that generates a temporally smooth
embedding, enabling the value function to be implicitly defined via the
embedding distance, which can then be used to construct the reward for any
goal-image specified downstream task. Trained on large-scale Ego4D human videos
and without any fine-tuning on in-domain, task-specific data, VIP's frozen
representation can provide dense visual reward for an extensive set of
simulated and $\textbf{real-robot}$ tasks, enabling diverse reward-based visual
control methods and significantly outperforming all prior pre-trained
representations. Notably, VIP can enable simple, $\textbf{few-shot}$ offline RL
on a suite of real-world robot tasks with as few as 20 trajectories.

本研究提出了一种称为 VIP 的表示自学习方法，通过自监督目标条件强化学习的方式从未标注的人类视频中生成稠密的，可平滑的奖励函数，克服机器人数据获取上的困难，并在实验中表现出优异的表现。

VIP：通过价值内隐预训练实现通用视觉奖励和表示

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Recently, MLP-based vision backbones emerge. MLP-based vision architectures
with less inductive bias achieve competitive performance in image recognition
compared with CNNs and vision Transformers. Among them, spatial-shift MLP
(S$^2$-MLP), adopting the straightforward spatial-shift operation, achieves
better performance than the pioneering works including MLP-mixer and ResMLP.
More recently, using smaller patches with a pyramid structure, Vision
Permutator (ViP) and Global Filter Network (GFNet) achieve better performance
than S$^2$-MLP.
In this paper, we improve the S$^2$-MLP vision backbone. We expand the
feature map along the channel dimension and split the expanded feature map into
several parts. We conduct different spatial-shift operations on split parts.
Meanwhile, we exploit the split-attention operation to fuse these split
parts. Moreover, like the counterparts, we adopt smaller-scale patches and use
a pyramid structure for boosting the image recognition accuracy. We term the
improved spatial-shift MLP vision backbone as S$^2$-MLPv2. Using 55M
parameters, our medium-scale model, S$^2$-MLPv2-Medium achieves an $83.6\%$
top-1 accuracy on the ImageNet-1K benchmark using $224\times 224$ images
without self-attention and external training data.

本研究致力于改进 S^2-MLP 视觉骨干，将特征图沿通道扩展并将其分成若干部分，然后对分割的部分进行不同的空间变换操作，同时利用分组注意力操作来融合这些分割的部分。采用更小的尺度补丁和金字塔结构提高图像识别的准确性，我们称之为 S^2-MLPv2。中型模型 S^2-MLPv2-Medium 使用 55M 个参数，在没有注意力机制和外部训练数据的情况下在 ImageNet-1K 基准测试中使用 224×224 图像实现了 83.6％的 top-1 准确率。