In this paper, we propose multimodal convolutional neural networks (m-CNNs)
for matching image and sentence. Our m-CNN provides an end-to-end framework
with convolutional architectures to exploit image representation, word
composition, and the matching relations between the two modalities. More
specifically, it consists of one image CNN encoding the image content, and one
matching CNN learning the joint representation of image and sentence. The
matching CNN composes words to different semantic fragments and learns the
inter-modal relations between image and the composed fragments at different
levels, thus fully exploit the matching relations between image and sentence.
Experimental results on benchmark databases of bidirectional image and sentence
retrieval demonstrate that the proposed m-CNNs can effectively capture the
information necessary for image and sentence matching. Specifically, our
proposed m-CNNs for bidirectional image and sentence retrieval on Flickr30K and
Microsoft COCO databases achieve the state-of-the-art performances.

本论文提出了多模态卷积神经网络 (m-CNNs)，用于匹配图像和句子。该网络结构采用卷积架构来利用图像表示、单词组合和两种模态之间的匹配关系。实验结果表明，我们的 m-CNNs 可以有效地捕捉图像和句子匹配所需的信息，并在 Flickr30K 和 Microsoft COCO 数据库的双向图像和句子检索上取得了最先进的性能。

多模态卷积神经网络：图像和文本匹配

Multimodal Convolutional Neural Networks for Matching Image and Sentence

Appearance-based gaze estimation is believed to work well in real-world
settings, but existing datasets have been collected under controlled laboratory
conditions and methods have been not evaluated across multiple datasets. In
this work we study appearance-based gaze estimation in the wild. We present the
MPIIGaze dataset that contains 213,659 images we collected from 15 participants
during natural everyday laptop use over more than three months. Our dataset is
significantly more variable than existing ones with respect to appearance and
illumination. We also present a method for in-the-wild appearance-based gaze
estimation using multimodal convolutional neural networks that significantly
outperforms state-of-the art methods in the most challenging cross-dataset
evaluation. We present an extensive evaluation of several state-of-the-art
image-based gaze estimation algorithms on three current datasets, including our
own. This evaluation provides clear insights and allows us to identify key
research challenges of gaze estimation in the wild.

本文针对外界复杂的实际应用场景下的视线估计问题，在自然、真实的使用环境中使用 MPIIGaze 数据集进行研究，并提出了一种多模态卷积神经网络的方法，通过跨数据集评估证明该方法显著优于现有的方法。我们还对三个最新数据集上的几种最先进的基于图像的凝视估计算法进行了细致评估，确认了外部环境变化对凝视估计的影响，为实际的实时视线估计研究提供了重要的参考。