There is increasing interest in using LLMs as decision-making "agents." Doing
so includes many degrees of freedom: which model should be used; how should it
be prompted; should it be asked to introspect, conduct chain-of-thought
reasoning, etc? Settling these questions -- and more broadly, determining
whether an LLM agent is reliable enough to be trusted -- requires a methodology
for assessing such an agent's economic rationality. In this paper, we provide
one. We begin by surveying the economic literature on rational decision making,
taxonomizing a large set of fine-grained "elements" that an agent should
exhibit, along with dependencies between them. We then propose a benchmark
distribution that quantitatively scores an LLMs performance on these elements
and, combined with a user-provided rubric, produces a "rationality report
card." Finally, we describe the results of a large-scale empirical experiment
with 14 different LLMs, characterizing the both current state of the art and
the impact of different model sizes on models' ability to exhibit rational
behavior.

使用 LLMs 作为决策 “代理人” 引起了越来越多的兴趣，但评估这种代理人的经济合理性仍然是一个关键问题。本文通过调查经济理论、提出基准分布和进行大规模实证实验，对 LLMs 的表现进行定量评估，揭示了当前技术水平以及模型大小对模型表现的影响。

合理性报告卡：评估大型语言模型的经济合理性

Rationality Report Cards: Assessing the Economic Rationality of Large  Language Models

We study Off-Policy Evaluation (OPE) in contextual bandit settings with large
action spaces. The benchmark estimators suffer from severe bias and variance
tradeoffs. Parametric approaches suffer from bias due to difficulty specifying
the correct model, whereas ones with importance weight suffer from variance. To
overcome these limitations, Marginalized Inverse Propensity Scoring (MIPS) was
proposed to mitigate the estimator's variance via embeddings of an action. To
make the estimator more accurate, we propose the doubly robust estimator of
MIPS called the Marginalized Doubly Robust (MDR) estimator. Theoretical
analysis shows that the proposed estimator is unbiased under weaker assumptions
than MIPS while maintaining variance reduction against IPS, which was the main
advantage of MIPS. The empirical experiment verifies the supremacy of MDR
against existing estimators.

我们在具有大动作空间的情境赌博设置中研究了离策略评估 (Off-Policy Evaluation，OPE)。基准估计器在严重的偏差和方差权衡中遇到困难。为了克服这些限制，我们提出了基于动作嵌入 (MIPS) 的边际化逆向倾向评分 (Marginalized Inverse Propensity Scoring, MIPS) 来减少估计器的方差。为了使估计器更准确，我们提出了 MIPS 的双重稳健估计器 (Marginalized Doubly Robust, MDR)。理论分析表明，所提出的估计器在比 MIPS 更弱的假设下是无偏的，同时保持对 IPS 的方差减少，这是 MIPS 的主要优势。经验实验证实了 MDR 对现有估计器的卓越性。

具有大行动空间的离策评估的双重稳健估计方法

Doubly Robust Estimator for Off-Policy Evaluation with Large Action  Spaces

We study video-grounded dialogue generation, where a response is generated
based on the dialogue context and the associated video. The primary challenges
of this task lie in (1) the difficulty of integrating video data into
pre-trained language models (PLMs) which presents obstacles to exploiting the
power of large-scale pre-training; and (2) the necessity of taking into account
the complementarity of various modalities throughout the reasoning process.
Although having made remarkable progress in video-grounded dialogue generation,
existing methods still fall short when it comes to integrating with PLMs in a
way that allows information from different modalities to complement each other.
To alleviate these issues, we first propose extracting pertinent information
from videos and turning it into reasoning paths that are acceptable to PLMs.
Additionally, we propose a multi-agent reinforcement learning method to
collaboratively perform reasoning on different modalities (i.e., video and
dialogue context). Empirical experiment results on two public datasets indicate
that the proposed model can significantly outperform state-of-the-art models by
large margins on both automatic and human evaluations.

本文研究了基于视频对话生成，提出一种方法，可以将视频数据集成到预训练语言模型中，通过多模态推理实现各种模态之间的互补信息，实验结果表明，该模型能够在自动和人工评估方面显著优于现有的最先进模型。