To operate at a building scale, service robots must perform very long-horizon mobile manipulation tasks by navigating to different rooms, accessing different floors, and interacting with a wide and unseen range of everyday objects. We refer to these tasks as Building-wide Mobile Manipulation. To tackle these inherently long-horizon tasks, we introduce BUMBLE, a unified Vision-Language Model (VLM)-based framework integrating open-world RGBD perception, a wide spectrum of gross-to-fine motor skills, and dual-layered memory. Our extensive evaluation (90+ hours) indicates that BUMBLE outperforms multiple baselines in long-horizon building-wide tasks that require sequencing up to 12 ground truth skills spanning 15 minutes per trial. BUMBLE achieves 47.1% success rate averaged over 70 trials in different buildings, tasks, and scene layouts from different starting rooms and floors. Our user study demonstrates 22% higher satisfaction with our method than state-of-the-art mobile manipulation methods. Finally, we demonstrate the potential of using increasingly-capable foundation models to push performance further. For more information, see https://robin-lab.cs.utexas.edu/BUMBLE/

本研究针对建筑范围内移动操控任务中的长时程挑战，提出了BUMBLE框架，通过集成开放世界RGBD感知、多样化的运动技能及双层内存来实现任务的高效执行。评估结果显示，BUMBLE在不同建筑、任务场景下的成功率达到47.1%，用户满意度相比现有方法提高22%，展现了使用先进基础模型提升性能的潜力。

BUMBLE：通过视觉-语言模型统一推理与行动以实现建筑范围内的移动操控