In this paper, we focus on the unsupervised video object segmentation (VOS) task which learns visual correspondence from unlabeled videos. Previous methods are mainly based on the contrastive learning paradigm, which optimize either in pixel level or image level and show unsatisfactory