The factuality of large language model (LLMs) tends to decay over time since
events posterior to their training are "unknown" to them. One way to keep
models up-to-date could be factual update: the task of inserting, replacing, or
removing certain simple (atomic) facts within the model. To study this task, we
present WikiFactDiff, a dataset that describes the evolution of factual
knowledge between two dates as a collection of simple facts divided into three
categories: new, obsolete, and static. We describe several update scenarios
arising from various combinations of these three types of basic update. The
facts are represented by subject-relation-object triples; indeed, WikiFactDiff
was constructed by comparing the state of the Wikidata knowledge base at 4
January 2021 and 27 February 2023. Those fact are accompanied by verbalization
templates and cloze tests that enable running update algorithms and their
evaluation metrics. Contrary to other datasets, such as zsRE and CounterFact,
WikiFactDiff constitutes a realistic update setting that involves various
update scenarios, including replacements, archival, and new entity insertions.
We also present an evaluation of existing update algorithms on WikiFactDiff.

大型语言模型的事实性随时间衰退，为保持模型时效性，可以进行事实更新，而 WikiFactDiff 是一个描述事实知识演变的数据集，包括更新场景和评估算法。

WikiFactDiff：用于因果语言模型中原子事实知识更新的大型、逼真和时态可调整的数据集

WikiFactDiff: A Large, Realistic, and Temporally Adaptable Dataset for  Atomic Factual Knowledge Update in Causal Language Models

A major challenge to deploying robots widely is navigation in human-populated
environments, commonly referred to as social robot navigation. While the field
of social navigation has advanced tremendously in recent years, the fair
evaluation of algorithms that tackle social navigation remains hard because it
involves not just robotic agents moving in static environments but also dynamic
human agents and their perceptions of the appropriateness of robot behavior. In
contrast, clear, repeatable, and accessible benchmarks have accelerated
progress in fields like computer vision, natural language processing and
traditional robot navigation by enabling researchers to fairly compare
algorithms, revealing limitations of existing solutions and illuminating
promising new directions. We believe the same approach can benefit social
navigation. In this paper, we pave the road towards common, widely accessible,
and repeatable benchmarking criteria to evaluate social robot navigation. Our
contributions include (a) a definition of a socially navigating robot as one
that respects the principles of safety, comfort, legibility, politeness, social
competency, agent understanding, proactivity, and responsiveness to context,
(b) guidelines for the use of metrics, development of scenarios, benchmarks,
datasets, and simulators to evaluate social navigation, and (c) a design of a
social navigation metrics framework to make it easier to compare results from
different simulators, robots and datasets.

本文提出了评估社交机器人导航算法的标准、度量方式和基于场景分析的指导方针，并设计了一个社交导航度量指标框架来比较不同的模拟器、机器人和数据集的结果。

评估社交机器人导航算法的原则和指南

Principles and Guidelines for Evaluating Social Robot Navigation  Algorithms

Many web systems rank and present a list of items to users, from recommender
systems to search and advertising. An important problem in practice is to
evaluate new ranking policies offline and optimize them before they are
deployed. We address this problem by proposing evaluation algorithms for
estimating the expected number of clicks on ranked lists from historical logged
data. The existing algorithms are not guaranteed to be statistically efficient
in our problem because the number of recommended lists can grow exponentially
with their length. To overcome this challenge, we use models of user
interaction with the list of items, the so-called click models, to construct
estimators that learn statistically efficiently. We analyze our estimators and
prove that they are more efficient than the estimators that do not use the
structure of the click model, under the assumption that the click model holds.
We evaluate our estimators in a series of experiments on a real-world dataset
and show that they consistently outperform prior estimators.

本文提出了一种评估算法来预测历史日志数据中排名列表上的点击数，并使用用户与项目列表的交互模型来构建统计效率更高的估计器。实验结果表明，相对于先前的估计器，该算法具有更高的性能表现。

使用点击模型对排名策略进行离线评估

Offline Evaluation of Ranking Policies with Click Models

Human labeled datasets, along with their corresponding evaluation algorithms,
play an important role in boundary detection. We here present a psychophysical
experiment that addresses the reliability of such benchmarks. To find better
remedies to evaluate the performance of any boundary detection algorithm, we
propose a computational framework to remove inappropriate human labels and
estimate the intrinsic properties of boundaries.

本文介绍了一个心理物理实验，研究人员标记的数据集及其对应的评估算法在边界检测中的重要性，提出了一种计算框架来消除不适当的人类标签并估计边界的内在属性，以找到更好的评估任何边界检测算法性能的方法。