Region-level multi-modality methods can translate referred image regions to
human preferred language descriptions. Unfortunately, most of existing methods
using fixed visual inputs remain lacking the resolution adaptability to find
out precise language descriptions. In this study, we propose a dynamic
resolution approach, referred to as DynRefer, to pursue high-accuracy
region-level referring through mimicking the resolution adaptability of human
visual cognition. DynRefer first implements stochastic vision-language
alignment. It aligns desired language descriptions of multi-modality tasks with
images of stochastic resolution, which are constructed by nesting a set of
views around the referred region. DynRefer then implements dynamic
multi-modality referring, which is realized by selecting views based on image
and language priors. This allows the visual information used for referring to
better match human preferences, thereby improving the representational
adaptability of region-level multi-modality models. Extensive experiments show
that DynRefer brings mutual improvement upon tasks including region-level
captioning, open-vocabulary region recognition and attribute detection. Last
but not least, DynRefer achieves new state-of-the-art on multiple region-level
multi-modality tasks using a single model. Code is available at
this https URL

通过动态分辨率方法（DynRefer）来改善区域级多模态任务的高精确度指代，提高多模态模型的表示适应性，并在多个区域级多模态任务上取得新的最先进结果。

DynRefer: 通过动态分辨率探索区域级多模态任务

DynRefer: Delving into Region-level Multi-modality Tasks via Dynamic  Resolution

Current architectures for multi-modality tasks such as visual question
answering suffer from their high complexity. As a result, these architectures
are difficult to train and require high computational resources. To address
these problems we present a CLIP-based architecture that does not require any
fine-tuning of the feature extractors. A simple linear classifier is used on
the concatenated features of the image and text encoder. During training an
auxiliary loss is added which operates on the answer types. The resulting
classification is then used as an attention gate on the answer class selection.
On the VizWiz 2022 Visual Question Answering Challenge we achieve 60.15 %
accuracy on Task 1: Predict Answer to a Visual Question and AP score of 83.78 %
on Task 2: Predict Answerability of a Visual Question.

该研究提出了一种基于 CLIP 的体系结构，通过将图像和文本编码器的拼接特征上应用简单的线性分类器，并在训练期间添加一项辅助损失，以操作答案类型，并将其作为答案类选择的注意力门，成功解决了多模态任务体系结构高复杂度、难以训练、计算资源要求高的问题，在 VizWiz 2022 视觉问答挑战赛上取得了 60.15％的准确率和 83.78％的平均精度分数。