This paper considers the task of matching images and sentences. The challenge consists in discriminatively embedding the two modalities onto a shared visual-textual space. Existing work in this field largely uses Recurrent Neural Networks (RNN) for text feature learning and employs off-the-shelf Convolutional Neural Networks (CNN) for image feature extractio