We introduce AutoVER, an Autoregressive model for Visual Entity Recognition.
Our model extends an autoregressive Multi-modal Large Language Model by
employing retrieval augmented constrained generation. It mitigates low
performance on out-of-domain entities while excelling in queries that require
visually-situated reasoning. Our method learns to distinguish similar entities
within a vast label space by contrastively training on hard negative pairs in
parallel with a sequence-to-sequence objective without an external retriever.
During inference, a list of retrieved candidate answers explicitly guides
language generation by removing invalid decoding paths. The proposed method
achieves significant improvements across different dataset splits in the
recently proposed Oven-Wiki benchmark. Accuracy on the Entity seen split rises
from 32.7% to 61.5%. It also demonstrates superior performance on the unseen
and query splits by a substantial double-digit margin.

AutoVER 是一种用于视觉实体识别的自回归模型，通过使用检索增强的约束生成来扩展多模态大型语言模型，以解决在域外实体上表现不佳但在需要视觉定位推理的查询上表现出色的问题。该方法通过在与序列到序列目标并行训练的同时对困难的负样本对进行对比训练，学习区分庞大标签空间内的相似实体。在推理过程中，一系列检索到的候选答案通过删除无效的解码路径明确地指导语言生成。所提出的方法在最近提出的 Oven-Wiki 基准测试的不同数据集划分上取得显著改进，实体已见划分的准确率从 32.7% 提高至 61.5%。在未见和查询划分上也通过大幅度的两位数优势展示出卓越性能。