A crucial ability of human intelligence is to build up models of individual 3D objects from partial scene observations. Recent works have enabled unsupervised 3D representation learning at scene-level, yet learning to decompose the 3D scene into 3D objects and build their individual models from multi-object scene images remains elusive. In this paper, we pro