Global visual localization in LiDAR-maps, crucial for autonomous driving
applications, remains largely unexplored due to the challenging issue of
bridging the cross-modal heterogeneity gap. Popular multi-modal learning
approach Contrastive Language-Image Pre-Training (CLIP) has popularized
contrastive symmetric loss using batch construction technique by applying it to
multi-modal domains of text and image. We apply this approach to the domains of
2D image and 3D LiDAR points on the task of cross-modal localization. Our
method is explained as follows: A batch of N (image, LiDAR) pairs is
constructed so as to predict what is the right match between N X N possible
pairings across the batch by jointly training an image encoder and LiDAR
encoder to learn a multi-modal embedding space. In this way, the cosine
similarity between N positive pairings is maximized, whereas that between the
remaining negative pairings is minimized. Finally, over the obtained similarity
scores, a symmetric cross-entropy loss is optimized. To the best of our
knowledge, this is the first work to apply batched loss approach to a
cross-modal setting of image & LiDAR data and also to show Zero-shot transfer
in a visual localization setting. We conduct extensive analyses on standard
autonomous driving datasets such as KITTI and KITTI-360 datasets. Our method
outperforms state-of-the-art recall@1 accuracy on the KITTI-360 dataset by
22.4%, using only perspective images, in contrast to the state-of-the-art
approach, which utilizes the more informative fisheye images. Additionally,
this superior performance is achieved without resorting to complex
architectures. Moreover, we demonstrate the zero-shot capabilities of our model
and we beat SOTA by 8% without even training on it. Furthermore, we establish
the first benchmark for cross-modal localization on the KITTI dataset.

利用对比映射预训练（CLIP）方法在图像与 LiDAR 点云的交叉模态本地化任务中，首次应用批处理损失方法并展示了零样本转移，并在 KITTI 数据集上取得了超过当前最先进方法 22.4% 的准确性提升，而且无需复杂的网络架构。

LIP-Loc: 跨模态定位的激光雷达图像预训练

LIP-Loc: LiDAR Image Pretraining for Cross-Modal Localization

In this report, we present our champion solution to the WSDM2023 Toloka
Visual Question Answering (VQA) Challenge. Different from the common VQA and
visual grounding (VG) tasks, this challenge involves a more complex scenario,
i.e. inferring and locating the object implicitly specified by the given
interrogative question. For this task, we leverage ViT-Adapter, a
pre-training-free adapter network, to adapt multi-modal pre-trained
Uni-Perceiver for better cross-modal localization. Our method ranks first on
the leaderboard, achieving 77.5 and 76.347 IoU on public and private test sets,
respectively. It shows that ViT-Adapter is also an effective paradigm for
adapting the unified perception model to vision-language downstream tasks. Code
and models will be released at
this https URL

本文介绍了我们在 WSDM2023 Toloka 视觉问答（VQA）挑战赛中的冠军解决方案。利用 ViT-Adapter 和 Uni-Perceiver 进行跨模态本地化，成功地实现了通过给定的疑问句推理和定位隐含指定的物品。我们的方法在公共和私人测试集上均名列榜首，实现了 77.5 和 76.347 IoU 的成绩。

WSDM2023 Toloka VQA 挑战赛的冠军解决方案

Champion Solution for the WSDM2023 Toloka VQA Challenge

Automatically localizing a position based on a few natural language
instructions is essential for future robots to communicate and collaborate with
humans. To approach this goal, we focus on the text-to-point-cloud cross-modal
localization problem. Given a textual query, it aims to identify the described
location from city-scale point clouds. The task involves two challenges. 1) In
city-scale point clouds, similar ambient instances may exist in several
locations. Searching each location in a huge point cloud with only instances as
guidance may lead to less discriminative signals and incorrect results. 2) In
textual descriptions, the hints are provided separately. In this case, the
relations among those hints are not explicitly described, leading to
difficulties of learning relations. To overcome these two challenges, we
propose a unified Relation-Enhanced Transformer (RET) to improve representation
discriminability for both point cloud and natural language queries. The core of
the proposed RET is a novel Relation-enhanced Self-Attention (RSA) mechanism,
which explicitly encodes instance (hint)-wise relations for the two modalities.
Moreover, we propose a fine-grained cross-modal matching method to further
refine the location predictions in a subsequent instance-hint matching stage.
Experimental results on the KITTI360Pose dataset demonstrate that our approach
surpasses the previous state-of-the-art method by large margin.

本文提出了一种统一的关系增强 Transformer (RET) 方法，通过使用新颖的关系增强自我关注机制和精细的跨模态匹配方法，成功地解决了文本到点云的交叉模态本地化问题，并在 KITTI360Pose 数据集上实现了比以前最先进方法更优异的实验结果。

关系增强变换器在文本到点云定位中的应用

Text to Point Cloud Localization with Relation-Enhanced Transformer

Natural language-based communication with mobile devices and home appliances
is becoming increasingly popular and has the potential to become natural for
communicating with mobile robots in the future. Towards this goal, we
investigate cross-modal text-to-point-cloud localization that will allow us to
specify, for example, a vehicle pick-up or goods delivery location. In
particular, we propose Text2Pos, a cross-modal localization module that learns
to align textual descriptions with localization cues in a coarse- to-fine
manner. Given a point cloud of the environment, Text2Pos locates a position
that is specified via a natural language-based description of the immediate
surroundings. To train Text2Pos and study its performance, we construct
KITTI360Pose, the first dataset for this task based on the recently introduced
KITTI360 dataset. Our experiments show that we can localize 65% of textual
queries within 15m distance to query locations for top-10 retrieved locations.
This is a starting point that we hope will spark future developments towards
language-based navigation.

本文提出了一种名为 Text2Pos 的模块，可以通过文本描述来定位物件位置，可为以后基于自然语言的导航奠定基础。