Large language models have demonstrated impressive performance on commonsense
tasks; however, these tasks are often posed as multiple-choice questions,
allowing models to exploit systematic biases. Commonsense is also inherently
probabilistic with multiple correct answers. The purpose of "boiling water"
could be making tea and cooking, but it also could be killing germs. Existing
tasks do not capture the probabilistic nature of common sense. To this end, we
present commonsense frame completion (CFC), a new generative task that
evaluates common sense via multiple open-ended generations. We also propose a
method of probabilistic evaluation that strongly correlates with human
judgments. Humans drastically outperform strong language model baselines on our
dataset, indicating this approach is both a challenging and useful evaluation
of machine common sense.

大型语言模型在常识任务上展现了令人印象深刻的性能；然而，这些任务通常作为多项选择题提出，使模型能够利用系统偏差。常识也具有概率性，存在多个正确答案。为此，我们提出了一种新的生成任务 - 常识框架补全（CFC），通过多个开放式生成来评估常识。我们还提出了与人类判断强相关的概率评估方法。在我们的数据集上，人类的表现远远超过强大的语言模型基线，表明这种方法既具有挑战性，也是对机器常识有用的评估方法。

每个答案都重要：用概率测度评估常识

Every Answer Matters: Evaluating Commonsense with Probabilistic Measures

The paper concerns the probabilistic evaluation of plans in the presence of
unmeasured variables, each plan consisting of several concurrent or sequential
actions. We establish a graphical criterion for recognizing when the effects of
a given plan can be predicted from passive observations on measured variables
only. When the criterion is satisfied, a closed-form expression is provided for
the probability that the plan will achieve a specified goal.

研究了在存在未测量变量的情况下，具有若干并发或顺序行动的计划的概率评估，并建立了图形判据以识别只通过测量变量的被动观测就可以预测给定计划效果的情况。当满足该标准时，为计划实现指定目标的概率提供了一个闭合表达式。

带有隐变量因果模型的顺序计划的概率评估

Probabilistic Evaluation of Sequential Plans from Causal Models with  Hidden Variables

This paper concerns the probabilistic evaluation of the effects of actions in
the presence of unmeasured variables. We show that the identification of causal
effect between a singleton variable X and a set of variables Y can be
accomplished systematically, in time polynomial in the number of variables in
the graph. When the causal effect is identifiable, a closed-form expression can
be obtained for the probability that the action will achieve a specified goal,
or a set of goals.

本篇论文关注于在未测量变量存在的情况下行动效果的概率评估。我们展示了单例变量 X 和一组变量 Y 之间的因果效应的辨识可以系统地完成，其时间复杂度多项式。当可辨识因果效应时，可以获得一个封闭形式的表达式，用于描述行动达到指定目标或一组目标的概率。