In this paper, we introduce the task of automatically generating text to
describe the differences between two similar images. We collect a new dataset
by crowd-sourcing difference descriptions for pairs of image frames extracted
from video-surveillance footage. Annotators were asked to succinctly describe
all the differences in a short paragraph. As a result