Large Language Models (LLMs) often do not perform well on queries that
require the aggregation of information across texts. To better evaluate this
setting and facilitate modeling efforts, we introduce TACT - Text And
Calculations through Tables, a dataset crafted to evaluate LLMs' reasoning and
computational abilities using complex instructions. TACT contains challenging
instructions that demand stitching information scattered across one or more
texts, and performing complex integration on this information to generate the
answer. We construct this dataset by leveraging an existing dataset of texts
and their associated tables. For each such tables, we formulate new queries,
and gather their respective answers. We demonstrate that all contemporary LLMs
perform poorly on this dataset, achieving an accuracy below 38\%. To pinpoint
the difficulties and thoroughly dissect the problem, we analyze model
performance across three components: table-generation, Pandas
command-generation, and execution. Unexpectedly, we discover that each
component presents substantial challenges for current LLMs. These insights lead
us to propose a focused modeling framework, which we refer to as IE as a tool.
Specifically, we propose to add "tools" for each of the above steps, and
implement each such tool with few-shot prompting. This approach shows an
improvement over existing prompting techniques, offering a promising direction
for enhancing model capabilities in these tasks.

使用 TACT 数据集评估了大型语言模型（LLMs）的推理和计算能力，发现现有模型在整合分散信息和执行复杂集成任务方面表现不佳。提出了一个名为 IE 作为工具的新建模型框架，通过为每个步骤添加工具并采用 few-shot prompting 方法，有效提升了模型在这些任务中的能力。

TACT: 提高复杂聚合推理的信息提取工具

TACT: Advancing Complex Aggregative Reasoning with Information  Extraction Tools

Developmental psychologists have spent decades devising experiments to test
the intelligence and knowledge of infants and children, tracing the origin of
crucial concepts and capacities. Moreover, experimental techniques in
developmental psychology have been carefully designed to discriminate the
cognitive capacities that underlie particular behaviors. We propose that using
classical experiments from child development is a particularly effective way to
probe the computational abilities of AI models, in general, and LLMs in
particular. First, the methodological techniques of developmental psychology,
such as the use of novel stimuli to control for past experience or control
conditions to determine whether children are using simple associations, can be
equally helpful for assessing the capacities of LLMs. In parallel, testing LLMs
in this way can tell us whether the information that is encoded in text is
sufficient to enable particular responses, or whether those responses depend on
other kinds of information, such as information from exploration of the
physical world. In this work we adapt classical developmental experiments to
evaluate the capabilities of LaMDA, a large language model from Google. We
propose a novel LLM Response Score (LRS) metric which can be used to evaluate
other language models, such as GPT. We find that LaMDA generates appropriate
responses that are similar to those of children in experiments involving social
understanding, perhaps providing evidence that knowledge of these domains is
discovered through language. On the other hand, LaMDA's responses in early
object and action understanding, theory of mind, and especially causal
reasoning tasks are very different from those of young children, perhaps
showing that these domains require more real-world, self-initiated exploration
and cannot simply be learned from patterns in language input.

利用儿童发展心理学经典实验评估大型语言模型（LLMs）的能力，提出一种评估 LLMs 能力的 LRS 度量，将 Google 的 LaMDA 模型应用于实验，发现 LaMDA 在社交认知任务中回答的适当反应与儿童相似，但在早期物体行为认知、心理理论以及因果推理方面的回答则与儿童有很大不同，表明这些领域需要更多的现实世界自发探索，不能简单地通过语言输入模式学习。