Ensuring that large language models (LLMs) reflect diverse user values and
preferences is crucial as their user bases expand globally. It is therefore
encouraging to see the growing interest in LLM personalization within the
research community. However, current works often rely on the LLM-as-a-Judge
approach for evaluation without thoroughly examining its validity. In this
paper, we investigate the reliability of LLM-as-a-Personalized-Judge, asking
LLMs to judge user preferences based on personas. Our findings suggest that
directly applying LLM-as-a-Personalized-Judge is less reliable than previously
assumed, showing low and inconsistent agreement with human ground truth. The
personas typically used are often overly simplistic, resulting in low
predictive power. To address these issues, we introduce verbal uncertainty
estimation into the LLM-as-a-Personalized-Judge pipeline, allowing the model to
express low confidence on uncertain judgments. This adjustment leads to much
higher agreement (above 80%) on high-certainty samples for binary tasks.
Through human evaluation, we find that the LLM-as-a-Personalized-Judge achieves
comparable performance to third-party humans evaluation and even surpasses
human performance on high-certainty samples. Our work indicates that
certainty-enhanced LLM-as-a-Personalized-Judge offers a promising direction for
developing more reliable and scalable methods for evaluating LLM
personalization.

基于个性化评判的高可信度大型语言模型在评估上提供了更可靠和可扩展的方法，并显示出与人类判断相当甚至超出人类在高可信度样本上的表现。

LLM 是否可以成为个性化的法官？

Can LLM be a Personalized Judge?

Considerable efforts to measure and mitigate gender bias in recent years have
led to the introduction of an abundance of tasks, datasets, and metrics used in
this vein. In this position paper, we assess the current paradigm of gender
bias evaluation and identify several flaws in it. First, we highlight the
importance of extrinsic bias metrics that measure how a model's performance on
some task is affected by gender, as opposed to intrinsic evaluations of model
representations, which are less strongly connected to specific harms to people
interacting with systems. We find that only a few extrinsic metrics are
measured in most studies, although more can be measured. Second, we find that
datasets and metrics are often coupled, and discuss how their coupling hinders
the ability to obtain reliable conclusions, and how one may decouple them. We
then investigate how the choice of the dataset and its composition, as well as
the choice of the metric, affect bias measurement, finding significant
variations across each of them. Finally, we propose several guidelines for more
reliable gender bias evaluation.

通过评估当前性别偏见评估范式并识别其中的一些缺陷，我们提出了一些更可靠的性别偏见评估指南，强调了衡量模型性别影响的外在偏差度量的重要性，并发现数据集和度量往往是相互耦合的，这是导致获取可靠结论能力受到阻碍的原因之一。

选择你的视角：性别偏见评估中的缺陷

Choose Your Lenses: Flaws in Gender Bias Evaluation

Deep reinforcement learning (RL) algorithms are predominantly evaluated by
comparing their relative performance on a large suite of tasks. Most published
results on deep RL benchmarks compare point estimates of aggregate performance
such as mean and median scores across tasks, ignoring the statistical
uncertainty implied by the use of a finite number of training runs. Beginning
with the Arcade Learning Environment (ALE), the shift towards
computationally-demanding benchmarks has led to the practice of evaluating only
a small number of runs per task, exacerbating the statistical uncertainty in
point estimates. In this paper, we argue that reliable evaluation in the few
run deep RL regime cannot ignore the uncertainty in results without running the
risk of slowing down progress in the field. We illustrate this point using a
case study on the Atari 100k benchmark, where we find substantial discrepancies
between conclusions drawn from point estimates alone versus a more thorough
statistical analysis. With the aim of increasing the field's confidence in
reported results with a handful of runs, we advocate for reporting interval
estimates of aggregate performance and propose performance profiles to account
for the variability in results, as well as present more robust and efficient
aggregate metrics, such as interquartile mean scores, to achieve small
uncertainty in results. Using such statistical tools, we scrutinize performance
evaluations of existing algorithms on other widely used RL benchmarks including
the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies
in prior comparisons. Our findings call for a change in how we evaluate
performance in deep RL, for which we present a more rigorous evaluation
methodology, accompanied with an open-source library rliable, to prevent
unreliable results from stagnating the field.

本文通过案例研究 Atari 100k 游戏数据集，强调在少量训练运行的深度强化学习算法中，为保证结果准确性和防止领域进展停滞，不可忽略数据的不确定性，提出用区间估计来评估强化学习算法的表现，并在常用数据集上分析了已有算法的性能，提出更为严谨的性能评估方法，并配有开源库 rliable。