Object search is a challenging task because when given complex language descriptions (e.g., "find the white cup on the table"), the robot must move its camera through the environment and recognize the described object. Previous works map language descriptions to a set of fixed object detectors with predetermined noise models, but these approaches are challenging to scale because new detectors need to be made for each object. In this work, we bridge the gap in realistic object search by posing the search problem as a partially observable Markov decision process (POMDP) where the object detector and visual sensor noise in the observation model is determined by a single Deep Neural Network conditioned on complex language descriptions. We incorporate the neural network's outputs into our language-conditioned observation model (LCOM) to represent dynamically changing sensor noise. With an LCOM, any language description of an object can be used to generate an appropriate object detector and noise model, and training an LCOM only requires readily available supervised image-caption datasets. We empirically evaluate our method by comparing against a state-of-the-art object search algorithm in simulation, and demonstrate that planning with our observation model yields a significantly higher average task completion rate (from 0.46 to 0.66) and more efficient and quicker object search than with a fixed-noise model. We demonstrate our method on a Boston Dynamics Spot robot, enabling it to handle complex natural language object descriptions and efficiently find objects in a room-scale environment.

在这项研究中，我们将物体搜索问题视为部分可观察的马尔可夫决策过程（POMDP），其中物体检测器和观察模型中的视觉传感器噪声由基于复杂语言描述的单个深度神经网络确定。通过我们的语言条件观察模型（LCOM），任何物体的语言描述都可以用来生成适当的物体检测器和噪声模型，并且训练LCOM仅需要现成的监督图像字幕数据集。我们在模拟环境中与最先进的物体搜索算法进行了实证评估，并证明使用我们的观察模型进行规划可以显著提高平均任务完成率（从0.46提高到0.66），并且比使用固定噪声模型的方法更高效快速地进行物体搜索。我们将该方法应用于Boston Dynamics Spot机器人上，在房间范围内处理复杂的自然语言物体描述，并有效地找到物体。

语言条件下的视觉目标搜索观测模型