Learning to generate natural scenes has always been a challenging task in
computer vision. It is even more painstaking when the generation is conditioned
on images with drastically different views. This is mainly because
understanding, corresponding, and transforming appearance and semantic
information across the views is not trivial. In this paper, we attem