Composed video retrieval (CoVR) is a challenging problem in computer vision
which has recently highlighted the integration of modification text with visual
queries for more sophisticated video search in large databases. Existing works
predominantly rely on visual queries combined with modification text to
distinguish relevant videos. However, such a strategy struggles to fully
preserve the rich query-specific context in retrieved target videos and only
represents the target video using visual embedding. We introduce a novel CoVR
framework that leverages detailed language descriptions to explicitly encode
query-specific contextual information and learns discriminative embeddings of
vision only, text only and vision-text for better alignment to accurately
retrieve matched target videos. Our proposed framework can be flexibly employed
for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on
three datasets show that our approach obtains state-of-the-art performance for
both CovR and zero-shot CoIR tasks, achieving gains as high as around 7% in
terms of recall@K=1 score. Our code, models, detailed language descriptions for
WebViD-CoVR dataset are available at
https://github.com/OmkarThawakar/composed-video-retrieval

使用详细的语言描述来显式编码特定查询背景信息和学习视觉、文本和视觉文本的判别嵌入，以更准确地检索匹配的目标视频的新型 CoVR 框架。

通过丰富的上下文和区分特征嵌入检索拼接视频

Composed Video Retrieval via Enriched Context and Discriminative  Embeddings

Verifying a question's validity before answering is crucial in real-world
applications, where users may provide imperfect instructions. In this scenario,
an ideal model should address the discrepancies in the query and convey them to
the users rather than generating the best possible answer. Addressing this
requirement, we introduce a new compositional visual question-answering
dataset, VISREAS, that consists of answerable and unanswerable visual queries
formulated by traversing and perturbing commonalities and differences among
objects, attributes, and relations. VISREAS contains 2.07M semantically diverse
queries generated automatically using Visual Genome scene graphs. The unique
feature of this task, validating question answerability with respect to an
image before answering, and the poor performance of state-of-the-art models
inspired the design of a new modular baseline, LOGIC2VISION that reasons by
producing and executing pseudocode without any external modules to generate the
answer. LOGIC2VISION outperforms generative models in VISREAS (+4.82% over
LLaVA-1.5; +12.23% over InstructBLIP) and achieves a significant gain in
performance against the classification models.

验证图像问题的可回答性及其对应于图像的性能对于实际应用中的问题回答至关重要。我们通过创建一个新的组合视觉问答数据集（VISREAS）来满足这一需求，并引入了一个新的基线模型（LOGIC2VISION），该模型通过生成并执行伪代码来进行推理，超越了目前在 VISREAS 上的生成模型，最终取得了显著的性能提升。

VISREAS: 复杂视觉推理与无法回答的问题

VISREAS: Complex Visual Reasoning with Unanswerable Questions

Visual queries 3D localization (VQ3D) is a task in the Ego4D Episodic Memory
Benchmark. Given an egocentric video, the goal is to answer queries of the form
"Where did I last see object X?", where the query object X is specified as a
static image, and the answer should be a 3D displacement vector pointing to
object X. However, current techniques use naive ways to estimate the camera
poses of video frames, resulting in a low query with pose (QwP) ratio, thus a
poor overall success rate. We design a new pipeline for the challenging
egocentric video camera pose estimation problem in our work. Moreover, we
revisit the current VQ3D framework and optimize it in terms of performance and
efficiency. As a result, we get the top-1 overall success rate of 25.8% on VQ3D
leaderboard, which is two times better than the 8.7% reported by the baseline.

通过设计新的 pipeline 并重新优化现有的 VQ3D 框架，我们在 VQ3D 排行榜中取得了 25.8% 的最佳成绩，比基线 8.7% 提高了两倍。

为自我中心视频估计更多的相机姿态对于 VQ3D 至关重要

Estimating more camera poses for ego-centric videos is essential for  VQ3D

The complexity of the visual world creates significant challenges for
comprehensive visual understanding. In spite of recent successes in visual
recognition, today's vision systems would still struggle to deal with visual
queries that require a deeper reasoning. We propose a knowledge base (KB)
framework to handle an assortment of visual queries, without the need to train
new classifiers for new tasks. Building such a large-scale multimodal KB
presents a major challenge of scalability. We cast a large-scale MRF into a KB
representation, incorporating visual, textual and structured data, as well as
their diverse relations. We introduce a scalable knowledge base construction
system that is capable of building a KB with half billion variables and
millions of parameters in a few hours. Our system achieves competitive results
compared to purpose-built models on standard recognition and retrieval tasks,
while exhibiting greater flexibility in answering richer visual queries.

本研究提出了一种知识库框架，通过构建一个大规模的多模态知识库来回答各种视觉查询，同时保持灵活性和可扩展性。研究表明所提系统能够取得有竞争力的结果，并能够应对更丰富的视觉查询。