We introduce Lumos, the first end-to-end multimodal question-answering system
with text understanding capabilities. At the core of Lumos is a Scene Text
Recognition (STR) component that extracts text from first person point-of-view
images, the output of which is used to augment input to a Multimodal Large
Language Model (MM-LLM). While building Lumos, we encountered numerous
challenges related to STR quality, overall latency, and model inference. In
this paper, we delve into those challenges, and discuss the system
architecture, design choices, and modeling techniques employed to overcome
these obstacles. We also provide a comprehensive evaluation for each component,
showcasing high quality and efficiency.

我们介绍了 Lumos，这是第一个具备文本理解能力的端到端多模态问答系统。Lumos 的核心是一个场景文本识别（STR）组件，用于从第一人称视角图像中提取文本，输出结果被用来增强多模态大型语言模型（MM-LLM）的输入。本文探讨了在构建 Lumos 时遇到的与 STR 质量、整体延迟和模型推理相关的各种挑战，以及克服这些障碍所采用的系统架构、设计选择和建模技术。我们还对每个组件进行了全面的评估，展示了高质量和高效率。