In recent years, vision-language research has shifted to study tasks which
require more complex reasoning, such as interactive question answering, visual
common sense reasoning, and question-answer plausibility prediction. However,
the datasets used for these problems fail to capture the complexity of real
inputs and multimodal environments, such as ambiguous natural language requests
and diverse digital domains. We introduce Mobile app Tasks with Iterative
Feedback (MoTIF), a dataset with natural language commands for the greatest
number of interactive environments to date. MoTIF is the first to contain
natural language requests for interactive environments that are not
satisfiable, and we obtain follow-up questions on this subset to enable
research on task uncertainty resolution. We perform initial feasibility
classification experiments and only reach an F1 score of 37.3, verifying the
need for richer vision-language representations and improved architectures to
reason about task feasibility.

该研究介绍了 Mobile app Tasks with Iterative Feedback (MoTIF) 数据集，对比以往任务更为复杂，集成自然语言指令，并引入不可满足情况及其后续问题，以解决任务不确定性，表明需要更丰富的视觉 - 语言表示和更高效的架构来解决任务可行性。

移动应用迭代反馈任务（MoTIF）：解决交互式视觉环境中任务可行性的问题

Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task  Feasibility in Interactive Visual Environments

Text-visual (or called semantic-visual) embedding is a central problem in
vision-language research. It typically involves mapping of an image and a text
description to a common feature space through a CNN image encoder and a RNN
language encoder. In this paper, we propose a new method for learning
text-visual embedding using both image titles and click-through data from an
image search engine. We also propose a new triplet loss function by modeling
positive awareness of the embedding, and introduce a novel mini-batch-based
hard negative sampling approach for better data efficiency in the learning
process. Experimental results show that our proposed method outperforms
existing methods, and is also effective for real-world text-to-visual
retrieval.

论文提出一种新的方法，使用图像标题和来自图像搜索引擎的点击数据来学习文本 - 视觉嵌入，并通过建模嵌入的积极感知提出新的三元损失函数，以及引入一种新的基于小批次的难例负采样方法来提高学习过程的数据效率，实验结果表明，该方法的表现优于现有方法，并且对于现实世界的文本到视觉检索也十分有效。