BriefGPT.xyz
Apr, 2023
定位再生成: 通过边界框桥接视觉和语言进行场景文本VQA
Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA
HTML
PDF
Yongxin Zhu, Zhen Liu, Yukang Liang, Xin Li, Hao Liu...
TL;DR
提出了一个用于场景文本视觉问答的多模态框架,采用“先定位再生成”的范式,将空间边界框作为连接文本和视觉模态的桥梁,通过预先训练的语言模型增强绝对准确率。
Abstract
In this paper, we propose a novel
multi-modal framework
for
scene text visual question answering
(STVQA), which requires models to read scene text in images for question answering. Apart from text or visual objec
→