Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate
actions in untrimmed videos unseen during training. Existing ZS-TAL methods
involve fine-tuning a model on a large amount of annotated training data. While
effective, training-based ZS-TAL approaches assume the availability of labeled
data for supervised learning, which can be impractical in some applications.
Furthermore, the training process naturally induces a domain bias into the
learned model, which may adversely affect the model's generalization ability to
arbitrary videos. These considerations prompt us to approach the ZS-TAL problem
from a radically novel perspective, relaxing the requirement for training data.
To this aim, we introduce a novel method that performs Test-Time adaptation for
Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained
Vision and Language Model (VLM). T3AL operates in three steps. First, a
video-level pseudo-label of the action category is computed by aggregating
information from the entire video. Then, action localization is performed
adopting a novel procedure inspired by self-supervised learning. Finally,
frame-level textual descriptions extracted with a state-of-the-art captioning
model are employed for refining the action region proposals. We validate the
effectiveness of T3AL by conducting experiments on the THUMOS14 and the
ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly
outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the
benefit of a test-time adaptation approach.

通过引入一种新的方法 (T3AL)，该方法对 Temporal Action Localization (TAL) 进行 Test-Time adaptation，并采用自我监督学习的启发式程序进行动作区域定位，使用最先进的字幕模型提取的帧级文本描述进一步完善动作区域提案，实验证明 T3AL 在 THUMOS14 和 ActivityNet-v1.3 数据集上明显优于基于最先进视觉语言模型的零样本方法，证实了测试时间自适应方法的益处。

测试时零样本时序动作定位

Test-Time Zero-Shot Temporal Action Localization

Existing open-vocabulary image segmentation methods require a fine-tuning
step on mask annotations and/or image-text datasets. Mask labels are
labor-intensive, which limits the number of categories in segmentation
datasets. As a result, the open-vocabulary capacity of pre-trained VLMs is
severely reduced after fine-tuning. However, without fine-tuning, VLMs trained
under weak image-text supervision tend to make suboptimal mask predictions when
there are text queries referring to non-existing concepts in the image. To
alleviate these issues, we introduce a novel recurrent framework that
progressively filters out irrelevant texts and enhances mask quality without
training efforts. The recurrent unit is a two-stage segmenter built upon a VLM
with frozen weights. Thus, our model retains the VLM's broad vocabulary space
and strengthens its segmentation capability. Experimental results show that our
method outperforms not only the training-free counterparts, but also those
fine-tuned with millions of additional data samples, and sets new
state-of-the-art records for both zero-shot semantic and referring image
segmentation tasks. Specifically, we improve the current record by 28.8, 16.0,
and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.

通过引入一个创新的递归框架，我们的研究表明，在不进行训练的情况下，我们的模型能够优于那些经过百万级附加样本微调的方法，为零样本语义和参考图像分割任务的最新记录设定了新的技术水平。

CLIP 作为 RNN：无需培训即可分割无限的视觉概念

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

Humans apprehend the world through various sensory modalities, yet language
is their predominant communication channel. Machine learning systems need to
draw on the same multimodal richness to have informed discourses with humans in
natural language; this is particularly true for systems specialized in
visually-dense information, such as dialogue, recommendation, and search
engines for clothing. To this end, we train a visual question answering (VQA)
system to answer complex natural language questions about apparel in fashion
photoshoot images. The key to the successful training of our VQA model is the
automatic creation of a visual question-answering dataset with 168 million
samples from item attributes of 207 thousand images using diverse templates.
The sample generation employs a strategy that considers the difficulty of the
question-answer pairs to emphasize challenging concepts. Contrary to the recent
trends in using several datasets for pretraining the visual question answering
models, we focused on keeping the dataset fixed while training various models
from scratch to isolate the improvements from model architecture changes. We
see that using the same transformer for encoding the question and decoding the
answer, as in language models, achieves maximum accuracy, showing that visual
language models (VLMs) make the best visual question answering systems for our
dataset. The accuracy of the best model surpasses the human expert level, even
when answering human-generated questions that are not confined to the template
formats. Our approach for generating a large-scale multimodal domain-specific
dataset provides a path for training specialized models capable of
communicating in natural language. The training of such domain-expert models,
e.g., our fashion VLM model, cannot rely solely on the large-scale
general-purpose datasets collected from the web.

该论文训练了一个视觉问答系统，使用多种模态的数据来回答关于时尚照片中服装的自然语言问题。他们使用大规模的领域特定的多模态数据集来训练该系统，该数据集使用模板自动生成，模型的最高准确性超过了人类专家水平。