We consider the problem of red teaming LLMs on elementary calculations and
algebraic tasks to evaluate how various prompting techniques affect the quality
of outputs. We present a framework to procedurally generate numerical questions
and puzzles, and compare the results with and without the application of
several red teaming techniques. Our findings suggest that even though
structured reasoning and providing worked-out examples slow down the
deterioration of the quality of answers, the gpt-3.5-turbo and gpt-4 models are
not well suited for elementary calculations and reasoning tasks, also when
being red teamed.

评估不同提示技术对解答质量的影响，通过红组合作将 LLMs 在基础计算和代数任务上进行测试。结果发现，尽管结构化推理和提供解题示例可以减缓解答质量的恶化，但 gpt-3.5-turbo 和 gpt-4 模型在基础计算和推理任务上表现不佳，即使在红组合作的情况下。

大规模语言模型的红队攻防：解决数学任务中的幻觉问题

Red Teaming for Large Language Models At Scale: Tackling Hallucinations  on Mathematics Tasks

Answering numerical questions over hybrid contents from the given tables and
text(TextTableQA) is a challenging task. Recently, Large Language Models (LLMs)
have gained significant attention in the NLP community. With the emergence of
large language models, In-Context Learning and Chain-of-Thought prompting have
become two particularly popular research topics in this field. In this paper,
we introduce a new prompting strategy called Hybrid prompt strategy and
Retrieval of Thought for TextTableQA. Through In-Context Learning, we prompt
the model to develop the ability of retrieval thinking when dealing with hybrid
data. Our method achieves superior performance compared to the fully-supervised
SOTA on the MultiHiertt dataset in the few-shot setting.

通过提出混合提示策略和检索思路以进行文本表格问答，我们的方法通过上下文学习和以连贯思路引导模型，在少样本情况下，相对于完全监督的最先进方法，在 MultiHiertt 数据集上实现了卓越的性能。

HRoT：表 - 文混合问答中的混合提示策略与思维检索

HRoT: Hybrid prompt strategy and Retrieval of Thought for Table-Text  Hybrid Question Answering

Forecasting future world events is a challenging but valuable task. Forecasts
of climate, geopolitical conflict, pandemics and economic indicators help shape
policy and decision making. In these domains, the judgment of expert humans
contributes to the best forecasts. Given advances in language modeling, can
these forecasts be automated? To this end, we introduce Autocast, a dataset
containing thousands of forecasting questions and an accompanying news corpus.
Questions are taken from forecasting tournaments, ensuring high quality,
real-world importance, and diversity. The news corpus is organized by date,
allowing us to precisely simulate the conditions under which humans made past
forecasts (avoiding leakage from the future). Motivated by the difficulty of
forecasting numbers across orders of magnitude (e.g. global cases of COVID-19
in 2022), we also curate IntervalQA, a dataset of numerical questions and
metrics for calibration. We test language models on our forecasting task and
find that performance is far below a human expert baseline. However,
performance improves with increased model size and incorporation of relevant
information from the news corpus. In sum, Autocast poses a novel challenge for
large language models and improved performance could bring large practical
benefits.

该研究介绍了 Autocast 数据集以及伴随的新闻语料库，以提高语言模型的预测能力，其中还包括数字问题和度量标准的 IntervalQA 数据集，并发现语言模型的性能远低于人类专家基准，但随着模型规模和新闻语料库相关信息的增加，性能有所提高。