Although semantic communication (SC) has shown its potential in efficiently
transmitting multi-modal data such as text, speeches and images, SC for videos
has focused primarily on pixel-level reconstruction. However, these SC systems
may be suboptimal for downstream intelligent tasks. Moreover, SC systems
without pixel-level video reconstruction present advantages by achieving higher
bandwidth efficiency and real-time performance of various intelligent tasks.
The difficulty in such system design lies in the extraction of task-related
compact semantic representations and their accurate delivery over noisy
channels. In this paper, we propose an end-to-end SC system for video question
answering (VideoQA) tasks called VideoQA-SC. Our goal is to accomplish VideoQA
tasks directly based on video semantics over noisy or fading wireless channels,
bypassing the need for video reconstruction at the receiver. To this end, we
develop a spatiotemporal semantic encoder for effective video semantic
extraction, and a learning-based bandwidth-adaptive deep joint source-channel
coding (DJSCC) scheme for efficient and robust video semantic transmission.
Experiments demonstrate that VideoQA-SC outperforms traditional and advanced
DJSCC-based SC systems that rely on video reconstruction at the receiver under
a wide range of channel conditions and bandwidth constraints. In particular,
when the signal-to-noise ratio is low, VideoQA-SC can improve the answer
accuracy by 5.17% while saving almost 99.5% of the bandwidth at the same time,
compared with the advanced DJSCC-based SC system. Our results show the great
potential of task-oriented SC system design for video applications.

本文提出了一种面向视频问答任务的端到端语义通信系统 VideoQA-SC，通过有效的视频语义提取和高效稳健的语义传输，绕过接收端的视频重构，实现了在嘈杂或衰落无线信道上直接完成视频问答任务，并实验证明在广泛的信道条件和带宽限制下，VideoQA-SC 在提高回答准确性的同时，节省了近 99.5% 的带宽。

VideoQA-SC：用于视频问答的自适应语义交流

VideoQA-SC: Adaptive Semantic Communication for Video Question Answering

In text-video retrieval, recent works have benefited from the powerful
learning capabilities of pre-trained text-image foundation models (e.g., CLIP)
by adapting them to the video domain. A critical problem for them is how to
effectively capture the rich semantics inside the video using the image encoder
of CLIP. To tackle this, state-of-the-art methods adopt complex cross-modal
modeling techniques to fuse the text information into video frame
representations, which, however, incurs severe efficiency issues in large-scale
retrieval systems as the video representations must be recomputed online for
every text query. In this paper, we discard this problematic cross-modal fusion
process and aim to learn semantically-enhanced representations purely from the
video, so that the video representations can be computed offline and reused for
different texts. Concretely, we first introduce a spatial-temporal "Prompt
Cube" into the CLIP image encoder and iteratively switch it within the encoder
layers to efficiently incorporate the global video semantics into frame
representations. We then propose to apply an auxiliary video captioning
objective to train the frame representations, which facilitates the learning of
detailed video semantics by providing fine-grained guidance in the semantic
space. With a naive temporal fusion strategy (i.e., mean-pooling) on the
enhanced frame representations, we obtain state-of-the-art performances on
three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.

该研究论文介绍了一种在文本 - 视频检索中学习视频语义表示的方法，通过将一个空间 - 时间上下文模块引入图像编码器，并通过辅助视频字幕目标进行训练，以提高视频帧的语义能力。在增强后的帧表示上使用简单的时序融合策略，取得了三个基准数据集（MSR-VTT，MSVD 和 LSMDC）的最先进性能。

Prompt Switch: 高效的 CLIP 适应文本 - 视频检索

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

Modern AI applications involving video, such as video-text alignment, video
search, and video captioning, benefit from a fine-grained understanding of
video semantics. Existing approaches for video understanding are either
data-hungry and need low-level annotation, or are based on general embeddings
that are uninterpretable and can miss important details. We propose LASER, a
neuro-symbolic approach that learns semantic video representations by
leveraging logic specifications that can capture rich spatial and temporal
properties in video data. In particular, we formulate the problem in terms of
alignment between raw videos and specifications. The alignment process
efficiently trains low-level perception models to extract a fine-grained video
representation that conforms to the desired high-level specification. Our
pipeline can be trained end-to-end and can incorporate contrastive and semantic
loss functions derived from specifications. We evaluate our method on two
datasets with rich spatial and temporal specifications:
20BN-Something-Something and MUGEN. We demonstrate that our method not only
learns fine-grained video semantics but also outperforms existing baselines on
downstream tasks such as video retrieval.

本研究提出了一种基于逻辑规格说明的神经符号方法 LASER，通过其可有效地训练低级感知模型以提取符合所需高级规格说明的细粒度视频表示，不仅可以学习细粒度的视频语义，而且还可以优于现有基准在下游任务中表现得更好。