Fusion-in-Decoder (FiD) is an effective retrieval-augmented language model
applied across a variety of open-domain tasks, such as question answering, fact
checking, etc. In FiD, supporting passages are first retrieved and then
processed using a generative model (Reader), which can cause a significant
bottleneck in decoding time, particularly with long outputs. In this work, we
analyze the contribution and necessity of all the retrieved passages to the
performance of reader models, and propose eliminating some of the retrieved
information, at the token level, that might not contribute essential
information to the answer generation process. We demonstrate that our method
can reduce run-time by up to 62.2%, with only a 2% reduction in performance,
and in some cases, even improve the performance results.

通过分析检索到的段落对阅读器模型性能的贡献和必要性，以及在令牌级别上消除一些可能对答案生成过程没有贡献的检索信息，我们证明了我们的方法能够在最多减少 62.2% 运行时间的同时，只有 2% 的性能下降甚至在某些情况下提高性能结果。

通过标记消除优化检索增强阅读模型

Optimizing Retrieval-augmented Reader Models via Token Elimination

A Retrieval-Augmented Language Model (RALM) augments a generative language
model by retrieving context-specific knowledge from an external database. This
strategy facilitates impressive text generation quality even with smaller
models, thus reducing orders of magnitude of computational demands. However,
RALMs introduce unique system design challenges due to (a) the diverse workload
characteristics between LM inference and retrieval and (b) the various system
requirements and bottlenecks for different RALM configurations such as model
sizes, database sizes, and retrieval frequencies. We propose Chameleon, a
heterogeneous accelerator system that integrates both LM and retrieval
accelerators in a disaggregated architecture. The heterogeneity ensures
efficient acceleration of both LM inference and retrieval, while the
accelerator disaggregation enables the system to independently scale both types
of accelerators to fulfill diverse RALM requirements. Our Chameleon prototype
implements retrieval accelerators on FPGAs and assigns LM inference to GPUs,
with a CPU server orchestrating these accelerators over the network. Compared
to CPU-based and CPU-GPU vector search systems, Chameleon achieves up to 23.72x
speedup and 26.2x energy efficiency. Evaluated on various RALMs, Chameleon
exhibits up to 2.16x reduction in latency and 3.18x speedup in throughput
compared to the hybrid CPU-GPU architecture. These promising results pave the
way for bringing accelerator heterogeneity and disaggregation into future RALM
systems.

创建了一种名为 Chameleon 的异构加速器系统，利用分体架构整合了语言模型和信息检索加速器，实现了对于不同的 Retrieval-Augmented Language Model 系统需求的高效加速，并在性能上取得了显著的提升。

变色龙：一种用于检索增强语言模型的异构和解聚加速器系统

Chameleon: a Heterogeneous and Disaggregated Accelerator System for  Retrieval-Augmented Language Models

Fusion-in-Decoder (FiD) is a powerful retrieval-augmented language model that
sets the state-of-the-art on many knowledge-intensive NLP tasks. However, FiD
suffers from very expensive inference. We show that the majority of inference
time results from memory bandwidth constraints in the decoder, and propose two
simple changes to the FiD architecture to speed up inference by 7x. The faster
decoder inference then allows for a much larger decoder. We denote FiD with the
above modifications as FiDO, and show that it strongly improves performance
over existing FiD models for a wide range of inference budgets. For example,
FiDO-Large-XXL performs faster inference than FiD-Base and achieves better
performance than FiD-Large.

使用两项简单的变更加速 FiD 架构的推理速度，并允许更大的 Decoder。我们称具有以上修改的 FiD 为 FiDO，并表明它在各种推理预算范围内均表现出更好的性能。