In traditional RAG framework, the basic retrieval units are normally short.
The common retrievers like DPR normally work with 100-word Wikipedia
paragraphs. Such a design forces the retriever to search over a large corpus to
find the `needle' unit. In contrast, the readers only need to extract answers
from the short retrieved units. Such an imbalanced `heavy' retriever and
`light' reader design can lead to sub-optimal performance. In order to
alleviate the imbalance, we propose a new framework LongRAG, consisting of a
`long retriever' and a `long reader'. LongRAG processes the entire Wikipedia
into 4K-token units, which is 30x longer than before. By increasing the unit
size, we significantly reduce the total units from 22M to 700K. This
significantly lowers the burden of retriever, which leads to a remarkable
retrieval score: answer recall@1=71% on NQ (previously 52%) and answer
recall@2=72% (previously 47%) on HotpotQA (full-wiki). Then we feed the top-k
retrieved units ($\approx$ 30K tokens) to an existing long-context LLM to
perform zero-shot answer extraction. Without requiring any training, LongRAG
achieves an EM of 62.7% on NQ, which is the best known result. LongRAG also
achieves 64.3% on HotpotQA (full-wiki), which is on par of the SoTA model. Our
study offers insights into the future roadmap for combining RAG with
long-context LLMs.

在传统的 RAG 框架中，检索单元通常很短，而我们提出的 LongRAG 框架则将整个维基百科以 4K-token 为单位处理，通过增加单元大小和减少总单元数量，降低了检索器的负担，并在没有任何训练的情况下实现了最佳结果，这为将 RAG 与长文本语言模型相结合的未来发展提供了启示。

LongRAG: 增强检索增强生成的长文本语言模型

LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

Leveraging Large Language Models (LLMs) as judges for evaluating the
performance of LLMs has recently garnered attention. Nonetheless, this type of
approach concurrently introduces potential biases from LLMs, raising concerns
about the reliability of the evaluation results. To mitigate this issue, we
propose and study two versions of many-shot in-context prompts, Reinforced and
Unsupervised ICL, for helping GPT-4o-as-a-Judge in single answer grading. Based
on the designed prompts, we investigate the impact of scaling the number of
in-context examples on the agreement and quality of the evaluation.
Furthermore, we first reveal the symbol bias in GPT-4o-as-a-Judge for pairwise
comparison and then propose a simple yet effective approach to mitigate it.
Experimental results show that advanced long-context LLMs, such as GPT-4o,
perform better in the many-shot regime than in the zero-shot regime. Meanwhile,
the experimental results further verify the effectiveness of the symbol bias
mitigation approach.

使用大型语言模型作为评判器评估大型语言模型的性能，可能引入潜在的偏见，并对评估结果的可靠性提出关切。为了缓解这个问题，我们提出和研究两种版本的多示例上下文提示（加强和无监督），以帮助 GPT-4o 作为评判器进行单答案打分。基于设计的提示，我们研究了增加上下文示例数量对评估的一致性和质量的影响。此外，我们首次揭示了 GPT-4o 作为评判器在两两比较中存在的符号偏差，并提出了一种简单而有效的方法来缓解它。实验结果显示，先进的长上下文语言模型，如 GPT-4o，在多示例情况下的表现优于零示例情况。同时，实验结果进一步验证了符号偏差缓解方法的有效性。

能否在长上下文中使用多样本情境学习来帮助 LLM 法官？更多观察，更好判断！

Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See  More, Judge Better!

Research on Large Language Models (LLMs) has recently witnessed an increasing
interest in extending models' context size to better capture dependencies
within long documents. While benchmarks have been proposed to assess long-range
abilities, existing efforts primarily considered generic tasks that are not
necessarily aligned with real-world applications. In contrast, our work
proposes a new benchmark for long-context LLMs focused on a practical meeting
assistant scenario. In this scenario, the long contexts consist of transcripts
obtained by automatic speech recognition, presenting unique challenges for LLMs
due to the inherent noisiness and oral nature of such data. Our benchmark,
named ELITR-Bench, augments the existing ELITR corpus' transcripts with 271
manually crafted questions and their ground-truth answers. Our experiments with
recent long-context LLMs on ELITR-Bench highlight a gap between open-source and
proprietary models, especially when questions are asked sequentially within a
conversation. We also provide a thorough analysis of our GPT-4-based evaluation
method, encompassing insights from a crowdsourcing study. Our findings suggest
that while GPT-4's evaluation scores are correlated with human judges', its
ability to differentiate among more than three score levels may be limited.

我们的研究提出了一个新的长文本上下文大型语言模型测试基准，名为 ELITR-Bench，侧重于实际的会议助手场景。我们使用 271 个手工制作的问题和其真实答案来增强现有的 ELITR 语料库的转录文本，实验结果显示当前公开源代码和专有模型之间在 ELITR-Bench 上存在差距，尤其是在对话中顺序提问的情况下。我们还对基于 GPT-4 的评估方法进行了详细分析，包括来自众包研究的见解，发现 GPT-4 的评估得分与人工评判的相关性较高，但在区分超过三个得分水平时其能力可能受到限制。