Although Large Language Models (LLMs) have demonstrated strong performance on
a wide range of tasks, they still face reliability challenges such as
hallucination. Previous studies reveal that highly capable LLMs like GPT-4 are
effective in judging the reliability of individual responses, while less
capable ones are often tuned to evaluate the relative reliability of responses
to the same query. To enable less capable LLMs to effectively judge the
reliability of individual responses, we propose a novel method named
$\textit{Meta}$ $\textit{Ranking}$ (MR). Unlike previous methods, which assess
the response directly, we achieve the judgement by comparing the target
query-response pair with reference query-response pairs. We found its
remarkable effectiveness in error detection for LLM responses on reasoning
tasks, where less capable LLMs could outperform strong baselines, even without
fine-tuning. We further demonstrate that MR can be used to enhance the
performance of LLMs in two practical applications: query routing and iterative
training data filtering. The former achieves GPT-4-turbo comparable performance
with less than half the token consumption, while the latter makes the
instruction-tuned LLaMA-7B and Phi-2, a 2.7B model, significantly surpass
Alpaca-13B over fewer training samples, underscoring the high potential of our
proposed method.

我们提出了一种名为 Meta Ranking (MR) 的新方法，通过比较目标查询 - 响应对与参考查询 - 响应对，使能力较弱的大语言模型能够有效判断个别响应的可靠性，并且在推理任务中实现了出色的误差检测效果，可以用于改进大语言模型的性能，如查询路由和迭代训练数据过滤等实际应用中。