We present Sequential Attend, Infer, Repeat (SQAIR), an interpretable deep generative model for videos of moving objects. It can reliably discover and track objects throughout the sequence of frames, and can also generate future frames conditioning on the current frame, thereby simulating expected motion of objects. This is achieved by explicitly encoding object presence, locations and appearances in the latent variables of the model. SQAIR retains all strengths of its predecessor, Attend, Infer, Repeat (AIR, Eslami et. al., 2016), including learning in an unsupervised manner, and addresses its shortcomings. We use a moving multi-MNIST dataset to show limitations of AIR in detecting overlapping or partially occluded objects, and show how SQAIR overcomes them by leveraging temporal consistency of objects. Finally, we also apply SQAIR to real-world pedestrian CCTV data, where it learns to reliably detect, track and generate walking pedestrians with no supervision.

本文介绍了一种基于深度生成模型的可解释视频对象追踪算法 Sequential Attend, Infer, Repeat (SQAIR)，它可以对视频中的对象进行可靠的发现和跟踪，并可以生成未来的视频帧，该模型的潜在变量明确地编码了对象的存在、位置和外观，并且具有 AIR 算法（Eslami et. al.，2016）的所有优点，在无监督学习的情况下学习，并且通过利用对象的时间一致性来克服 AIR 算法在检测重叠或部分遮挡对象方面的局限性，该算法还可用于实时行人 CCTV 数据的对象检测，跟踪和生成。

顺序关注、推理、重复: 运动物体的生成建模