Video localization tasks aim to temporally locate specific instances in
videos, including temporal action localization (TAL), sound event detection
(SED) and audio-visual event localization (AVEL). Existing methods
over-specialize on each task, overlooking the fact that these instances often
occur in the same video to form the complete video content. In this work, we
present UniAV, a Unified Audio-Visual perception network, to achieve joint
learning of TAL, SED and AVEL tasks for the first time. UniAV can leverage
diverse data available in task-specific datasets, allowing the model to learn
and share mutually beneficial knowledge across tasks and modalities. To tackle
the challenges posed by substantial variations in datasets
(size/domain/duration) and distinct task characteristics, we propose to
uniformly encode visual and audio modalities of all videos to derive generic
representations, while also designing task-specific experts to capture unique
knowledge for each task. Besides, we develop a unified language-aware
classifier by utilizing a pre-trained text encoder, enabling the model to
flexibly detect various types of instances and previously unseen ones by simply
changing prompts during inference. UniAV outperforms its single-task
counterparts by a large margin with fewer parameters, achieving on-par or
superior performances compared to state-of-the-art task-specific methods across
ActivityNet 1.3, DESED and UnAV-100 benchmarks.

UniAV 是一种统一的视听感知网络，可以联合学习时间动作定位（TAL）、声音事件检测（SED）和视听事件定位（AVEL）任务，并通过使用预训练的文本编码器设计统一的语言感知分类器，实现对各种类型实例的灵活检测。UniAV 通过更少的参数比单一任务模型，在 ActivityNet 1.3、DESED 和 UnAV-100 基准测试中取得与最先进的任务特定方法相当或优秀的性能。

UniAV：统一的音频视觉感知支持多任务视频定位

UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization

Understanding videos to localize moments with natural language often requires
large expensive annotated video regions paired with language queries. To
eliminate the annotation costs, we make a first attempt to train a natural
language video localization model in zero-shot manner. Inspired by unsupervised
image captioning setup, we merely require random text corpora, unlabeled video
collections, and an off-the-shelf object detector to train a model. With the
unpaired data, we propose to generate pseudo-supervision of candidate temporal
regions and corresponding query sentences, and develop a simple NLVL model to
train with the pseudo-supervision. Our empirical validations show that the
proposed pseudo-supervised method outperforms several baseline approaches and a
number of methods using stronger supervision on Charades-STA and
ActivityNet-Captions.

本文提出了一种新颖的伪监督方法，用于零样本学习自然语言视频定位模型，并在 Charades-STA 和 ActivityNet-Captions 数据集上实验验证该方法相较于其他方法性能有明显提升。

零样本自然语言视频定位

Zero-shot Natural Language Video Localization

Due to the large memory footprint of untrimmed videos, current
state-of-the-art video localization methods operate atop precomputed video clip
features. These features are extracted from video encoders typically trained
for trimmed action classification tasks, making such features not necessarily
suitable for temporal localization. In this work, we propose a novel supervised
pretraining paradigm for clip features that not only trains to classify
activities but also considers background clips and global video information to
improve temporal sensitivity. Extensive experiments show that using features
trained with our novel pretraining strategy significantly improves the
performance of recent state-of-the-art methods on three tasks: Temporal Action
Localization, Action Proposal Generation, and Dense Video Captioning. We also
show that our pretraining approach is effective across three encoder
architectures and two pretraining datasets. We believe video feature encoding
is an important building block for localization algorithms, and extracting
temporally-sensitive features should be of paramount importance in building
more accurate models. The code and pretrained models are available on our
project website.

该研究提出了一种新的监督预训练范例，通过考虑背景剪辑和全局视频信息，不仅需要训练活动分类，而且还需要训练时序灵敏度，从而显着提高了最近最先进的方法在三个任务中的性能：时间动作本地化，行动建议生成和密集视频字幕。