Reducing the representational discrepancy between source and target domains
is a key component to maximize the model generalization. In this work, we
advocate for leveraging natural language supervision for the domain
generalization task. We introduce two modules to ground visual representations
with texts containing typical reasoning of humans: (1) Visual and Textual Joint
Embedder and (2) Textual Explanation Generator. The former learns the
image-text joint embedding space where we can ground high-level
class-discriminative information into the model. The latter leverages an
explainable model and generates explanations justifying the rationale behind
its decision. To the best of our knowledge, this is the first work to leverage
the vision-and-language cross-modality approach for the domain generalization
task. Our experiments with a newly created CUB-DG benchmark dataset demonstrate
that cross-modality supervision can be successfully used to ground
domain-invariant visual representations and improve the model generalization.
Furthermore, in the large-scale DomainBed benchmark, our proposed method
achieves state-of-the-art results and ranks 1st in average performance for five
multi-domain datasets. The dataset and codes are available at
this https URL.

本文提出了一种基于自然语言监督的跨模态领域泛化方法，利用视觉和文本交互的表征来实现高级别类别判别的信息融合，并使用可解释的模型来生成解释，从而提高模型的泛化能力和性能。作者的方法在多个数据集上均取得了最新领先的结果。

利用文本为视觉表示建立通用领域基础

Grounding Visual Representations with Texts for Domain Generalization

The visual dialog task attempts to train an agent to answer multi-turn
questions given an image, which requires the deep understanding of interactions
between the image and dialog history. Existing researches tend to employ the
modality-specific modules to model the interactions, which might be troublesome
to use. To fill in this gap, we propose a unified framework for image-text
joint embedding, named VU-BERT, and apply patch projection to obtain vision
embedding firstly in visual dialog tasks to simplify the model. The model is
trained over two tasks: masked language modeling and next utterance retrieval.
These tasks help in learning visual concepts, utterances dependence, and the
relationships between these two modalities. Finally, our VU-BERT achieves
competitive performance (0.7287 NDCG scores) on VisDial v1.0 Datasets.

本文提出了一种名为 VU-BERT 图文联合嵌入的框架，通过用 patch projection 获取视觉嵌入来简化模型，从而解决了现有研究中用于建模交互的具有特定模态的模块难以使用的问题，并在可视对话任务上取得了较高的竞争性表现。