In contrast to conventional visual question answering, video-grounded dialog
necessitates a profound understanding of both dialog history and video content
for accurate response generation. Despite commendable strides made by existing
methodologies, they often grapple with the challenges of incrementally
understanding intricate dialog histories and assimilating video information. In
response to this gap, we present an iterative tracking and reasoning strategy
that amalgamates a textual encoder, a visual encoder, and a generator. At its
core, our textual encoder is fortified with a path tracking and aggregation
mechanism, adept at gleaning nuances from dialog history that are pivotal to
deciphering the posed questions. Concurrently, our visual encoder harnesses an
iterative reasoning network, meticulously crafted to distill and emphasize
critical visual markers from videos, enhancing the depth of visual
comprehension. Culminating this enriched information, we employ the pre-trained
GPT-2 model as our response generator, stitching together coherent and
contextually apt answers. Our empirical assessments, conducted on two renowned
datasets, testify to the prowess and adaptability of our proposed design.

对比传统的视觉问答，基于视频的对话需要深入理解对话历史和视频内容以实现准确的响应生成。为了解决现有方法在逐步理解复杂对话历史和融入视频信息方面所面临的挑战，我们提出了一种迭代的跟踪与推理策略，将文本编码器、视觉编码器和生成器相结合。在核心部分，我们的文本编码器具有路径追踪和聚合机制，能够从对话历史中提取对解读提问至关重要的细微差别。同时，我们的视觉编码器采用迭代推理网络，精心设计以从视频中提取和强调关键的视觉标记，增强视觉理解的深度。通过使用预训练的 GPT-2 模型作为响应生成器，将这些丰富的信息整合在一起，生成连贯和与上下文相关的答案。我们在两个有名的数据集上进行的实证评估证实了我们提出设计的实力和适应性。

揭示隐藏的关联：针对与视频相关的对话进行迭代跟踪和推理

Uncovering Hidden Connections: Iterative Tracking and Reasoning for  Video-grounded Dialog

Outside-knowledge visual question answering is a challenging task that
requires both the acquisition and the use of open-ended real-world knowledge.
Some existing solutions draw external knowledge into the cross-modality space
which overlooks the much vaster textual knowledge in natural-language space,
while others transform the image into a text that further fuses with the
textual knowledge into the natural-language space and completely abandons the
use of visual features. In this paper, we are inspired to constrain the
cross-modality space into the same space of natural-language space which makes
the visual features preserved directly, and the model still benefits from the
vast knowledge in natural-language space. To this end, we propose a novel
framework consisting of a multimodal encoder, a textual encoder and an answer
decoder. Such structure allows us to introduce more types of knowledge
including explicit and implicit multimodal and textual knowledge. Extensive
experiments validate the superiority of the proposed method which outperforms
the state-of-the-art by 6.17% accuracy. We also conduct comprehensive ablations
of each component, and systematically study the roles of varying types of
knowledge. Codes and knowledge data can be found at
this https URL

本文提出一种包含多模态编码器、文本编码器和答案解码器的新型框架，将跨模态空间约束到自然语言空间中，使得视觉特征直接保留在其中，并且从自然语言空间中获得更多的类型知识，实验证明其在多数情况下表现优异。

思考与观察的结合用于基于外部知识的视觉问答

Combo of Thinking and Observing for Outside-Knowledge VQA

Few-Shot learning aims to train and optimize a model that can adapt to unseen
visual classes with only a few labeled examples. The existing few-shot learning
(FSL) methods, heavily rely only on visual data, thus fail to capture the
semantic attributes to learn a more generalized version of the visual concept
from very few examples. However, it is a known fact that human visual learning
benefits immensely from inputs from multiple modalities such as vision,
language, and audio. Inspired by the human learning nature of encapsulating the
existing knowledge of a visual category which is in the form of language, we
introduce a contrastive alignment mechanism for visual and semantic feature
vectors to learn much more generalized visual concepts for few-shot learning.
Our method simply adds an auxiliary contrastive learning objective which
captures the contextual knowledge of a visual category from a strong textual
encoder in addition to the existing training mechanism. Hence, the approach is
more generalized and can be plugged into any existing FSL method. The
pre-trained semantic feature extractor (learned from a large-scale text
corpora) we use in our approach provides a strong contextual prior knowledge to
assist FSL. The experimental results done in popular FSL datasets show that our
approach is generic in nature and provides a strong boost to the existing FSL
baselines.

本文介绍了一种对称的对齐机制，用于学习从极少的例子中获取更广义的视觉概念的方法。实验结果表明，该方法是通用的，并提供了一个强大的基准。