Video scene parsing in the wild with diverse scenarios is a challenging and
great significance task, especially with the rapid development of automatic
driving technique. The dataset Video Scene Parsing in the Wild(VSPW) contains
well-trimmed long-temporal, dense annotation and high resolution clips. Based
on VSPW, we design a Temporal Bilateral Network with Vision Transformer. We
first design a spatial path with convolutions to generate low level features
which can preserve the spatial information. Meanwhile, a context path with
vision transformer is employed to obtain sufficient context information.
Furthermore, a temporal context module is designed to harness the inter-frames
contextual information. Finally, the proposed method can achieve the mean
intersection over union(mIoU) of 49.85\% for the VSPW2021 Challenge test
dataset.

本研究使用 VSPW 数据集设计了一个基于时空双边网络和视觉转换器的视频场景解析模型，该模型利用卷积和视觉转换器获得空间和上下文信息，并且使用时间上下文模块获取帧间上下文信息，实验证明该模型可以在 VSPW2021 挑战赛中获得 49.85% 的 mIoU。