Training on large-scale datasets can boost the performance of video instance
segmentation while the annotated datasets for VIS are hard to scale up due to
the high labor cost. What we possess are numerous isolated filed-specific
datasets, thus, it is appealing to jointly train models across the aggregation
of datasets to enhance data volume and diversity. However, due to the
heterogeneity in category space, as mask precision increases with the data
volume, simply utilizing multiple datasets will dilute the attention of models
on different taxonomies. Thus, increasing the data scale and enriching taxonomy
space while improving classification precision is important. In this work, we
analyze that providing extra taxonomy information can help models concentrate
on specific taxonomy, and propose our model named Taxonomy-aware Multi-dataset
Joint Training for Video Instance Segmentation (TMT-VIS) to address this vital
challenge. Specifically, we design a two-stage taxonomy aggregation module that
first compiles taxonomy information from input videos and then aggregates these
taxonomy priors into instance queries before the transformer decoder. We
conduct extensive experimental evaluations on four popular and challenging
benchmarks, including YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and UVO. Our
model shows significant improvement over the baseline solutions, and sets new
state-of-the-art records on all benchmarks. These appealing and encouraging
results demonstrate the effectiveness and generality of our approach. The code
is available at
https://github.com/rkzheng99/TMT-VIS(this https URL)

通过提供额外的分类信息，我们提出了一种名为 TMT-VIS 的模型，用于视频实例分割的多数据集联合训练，该模型在四个流行且具有挑战性的基准测试中均显著改善了基线解决方案，并创下了新的最先进记录。

TMT-VIS: 视频实例分割的层级感知多数据集联合训练

TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance  Segmentation

In video object tracking, there exist rich temporal contexts among successive
frames, which have been largely overlooked in existing trackers. In this work,
we bridge the individual video frames and explore the temporal contexts across
them via a transformer architecture for robust object tracking. Different from
classic usage of the transformer in natural language processing tasks, we
separate its encoder and decoder into two parallel branches and carefully
design them within the Siamese-like tracking pipelines. The transformer encoder
promotes the target templates via attention-based feature reinforcement, which
benefits the high-quality tracking model generation. The transformer decoder
propagates the tracking cues from previous templates to the current frame,
which facilitates the object searching process. Our transformer-assisted
tracking framework is neat and trained in an end-to-end manner. With the
proposed transformer, a simple Siamese matching approach is able to outperform
the current top-performing trackers. By combining our transformer with the
recent discriminative tracking pipeline, our method sets several new
state-of-the-art records on prevalent tracking benchmarks.

本文提出了一个基于 transformer 的视频物体跟踪器，在 Siamese-like 跟踪管道中，编码器在模板中注入了基于注意力机制的特征强化以增强模型生成质量，解码器传递上一个模板的跟踪线索到当前帧以方便目标搜索，结果本方法作为一种最先进的记录在普遍跟踪基准上获得了成功。