In this paper, we introduce a novel visual representation learning which
relies on a handful of adaptively learned tokens, and which is applicable to
both image and video understanding tasks. Instead of relying on hand-designed
splitting strategies to obtain visual tokens and processing a large number of
densely sampled patches for attention, our approach learns to mine important
tokens in visual data. This results in efficiently and effectively finding a
few important visual tokens and enables modeling of pairwise attention between
such tokens, over a longer temporal horizon for videos, or the spatial content
in images. Our experiments demonstrate strong performance on several
challenging benchmarks for both image and video recognition tasks. Importantly,
due to our tokens being adaptive, we accomplish competitive results at
significantly reduced compute amount. We obtain comparable results to the
state-of-the-arts on ImageNet while being computationally more efficient. We
also confirm the effectiveness of the approach on multiple video datasets,
including Kinetics-400, Kinetics-600, Charades, and AViD.
The code is available at:
this https URL

本文介绍了一种新的视觉表示学习方法，它依赖于少量自适应学习的令牌，并适用于图像和视频理解任务。与依赖手动设计的分割策略和处理大量密集抽样补丁以获取注意力的方法不同，我们的方法学习从视觉数据中挖掘重要令牌，从而有效地找到一些重要的视觉令牌，并使其能够对视频中更长的时空范围或图像中的空间内容进行配对注意力建模，同时具有更高的计算效率。在多个具有挑战性的基准测试中表现强劲，并且获得了与 ImageNet 的最新结果可比的结果，同时计算量显著降低。我们还在多个视频数据集（包括 Kinetics-400，Kinetics-600，Charades 和 AViD）上验证了该方法的有效性。

TokenLearner：8 个学习到的令牌能为图像和视频做什么？

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

Weakly supervised referring expression grounding (REG) aims at localizing the
referential entity in an image according to linguistic query, where the mapping
between the image region (proposal) and the query is unknown in the training
stage. In referring expressions, people usually describe a target entity in
terms of its relationship with other contextual entities as well as visual
attributes. However, previous weakly supervised REG methods rarely pay
attention to the relationship between the entities. In this paper, we propose a
knowledge-guided pairwise reconstruction network (KPRN), which models the
relationship between the target entity (subject) and contextual entity (object)
as well as grounds these two entities. Specifically, we first design a
knowledge extraction module to guide the proposal selection of subject and
object. The prior knowledge is obtained in a specific form of semantic
similarities between each proposal and the subject/object. Second, guided by
such knowledge, we design the subject and object attention module to construct
the subject-object proposal pairs. The subject attention excludes the unrelated
proposals from the candidate proposals. The object attention selects the most
suitable proposal as the contextual proposal. Third, we introduce a pairwise
attention and an adaptive weighting scheme to learn the correspondence between
these proposal pairs and the query. Finally, a pairwise reconstruction module
is used to measure the grounding for weakly supervised learning. Extensive
experiments on four large-scale datasets show our method outperforms existing
state-of-the-art methods by a large margin.

本文提出一种基于知识引导的配对重构网络（KPRN）框架来解决弱监督参考表达基础（REG）问题，并进行了四个大规模数据集的实验来展现模型的优异性能。