Vision-language models (VLMs) are typically composed of a vision encoder,
e.g. CLIP, and a language model (LM) that interprets the encoded features to
solve downstream tasks. Despite remarkable progress, VLMs are subject to
several shortcomings due to the limited capabilities of vision encoders, e.g.
"blindness" to certain image features, visual hallucination, etc. To address
these issues, we study broadening the visual encoding capabilities of VLMs. We
first comprehensively benchmark several vision encoders with different
inductive biases for solving VLM tasks. We observe that there is no single
encoding configuration that consistently achieves top performance across
different tasks, and encoders with different biases can perform surprisingly
similarly. Motivated by this, we introduce a method, named BRAVE, that
consolidates features from multiple frozen encoders into a more versatile
representation that can be directly fed as the input to a frozen LM. BRAVE
achieves state-of-the-art performance on a broad range of captioning and VQA
benchmarks and significantly reduces the aforementioned issues of VLMs, while
requiring a smaller number of trainable parameters than existing methods and
having a more compressed representation. Our results highlight the potential of
incorporating different visual biases for a more broad and contextualized
visual understanding of VLMs.

通常，视觉语言模型（VLM）由视觉编码器（例如 CLIP）和解释编码特征以解决下游任务的语言模型（LM）组成。我们研究拓展 VLM 的视觉编码能力以应对其局限性，我们首先全面评估了几个具有不同归纳偏差的视觉编码器在解决 VLM 任务时的性能。我们观察到，没有一种单一的编码配置能在不同任务中始终达到最佳性能，具有不同偏差的编码器可以表现出令人惊讶的相似性。出于这个动机，我们提出了一种名为 BRAVE 的方法，该方法将多个冻结的编码器的特征整合成更多变的表示，并直接作为冻结的 LM 的输入。BRAVE 在广泛的字幕生成和视觉问答基准上实现了最先进的性能，并显著减轻了 VLM 的先前问题，同时需要比现有方法更少的可训练参数并具有更紧凑的表示。我们的结果突显了将不同的视觉偏差纳入 VLM 以获得更广泛和上下文化的视觉理解的潜力。

BRAVE：拓宽视觉语言模型的视觉编码

BRAVE: Broadening the visual encoding of vision-language models

Most successful self-supervised learning methods are trained to align the
representations of two independent views from the data. State-of-the-art
methods in video are inspired by image techniques, where these two views are
similarly extracted by cropping and augmenting the resulting crop. However,
these methods miss a crucial element in the video domain: time. We introduce
BraVe, a self-supervised learning framework for video. In BraVe, one of the
views has access to a narrow temporal window of the video while the other view
has a broad access to the video content. Our models learn to generalise from
the narrow view to the general content of the video. Furthermore, BraVe
processes the views with different backbones, enabling the use of alternative
augmentations or modalities into the broad view such as optical flow, randomly
convolved RGB frames, audio or their combinations. We demonstrate that BraVe
achieves state-of-the-art results in self-supervised representation learning on
standard video and audio classification benchmarks including UCF101, HMDB51,
Kinetics, ESC-50 and AudioSet.

BraVe 采用不同的视角和时间窗口对视频进行自我监督学习，利用不同的后骨干网络以实现对视图的增强和制作多种视听模型，成功在 UCF101、HMDB51、Kinetics、ESC-50 以及 AudioSet 视听分类基准测试中取得最先进的结果。