The New York Times Connections game has emerged as a popular and challenging
pursuit for word puzzle enthusiasts. We collect 200 Connections games to
evaluate the performance of state-of-the-art large language models (LLMs)
against expert and novice human players. Our results show that even the
best-performing LLM, GPT-4o, which has otherwise shown impressive reasoning
abilities on a wide variety of benchmarks, can only fully solve 8% of the
games. Compared to GPT-4o, novice and expert players perform better, with
expert human players significantly outperforming GPT-4o. To deepen our
understanding we create a taxonomy of the knowledge types required to
successfully categorize words in the Connections game, revealing that LLMs
struggle with associative, encyclopedic, and linguistic knowledge. Our findings
establish the New York Times Connections game as a challenging benchmark for
evaluating abstract reasoning capabilities in humans and AI systems.

《纽约时报连线游戏对大型语言模型的性能评估》揭示了大型语言模型在解决连线游戏时的局限性，同时发现专家玩家在这方面表现更好，为人工智能系统的抽象推理能力提供了具有挑战性的评估基准。

连接点：使用纽约时报连线游戏评估 LLMs 的抽象推理能力

Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs  Using the New York Times Connections Word Game

Reinforcement learning from human feedback (RLHF) is a technique for training
AI systems to align with human goals. RLHF has emerged as the central method
used to finetune state-of-the-art large language models (LLMs). Despite this
popularity, there has been relatively little public work systematizing its
flaws. In this paper, we (1) survey open problems and fundamental limitations
of RLHF and related methods; (2) overview techniques to understand, improve,
and complement RLHF in practice; and (3) propose auditing and disclosure
standards to improve societal oversight of RLHF systems. Our work emphasizes
the limitations of RLHF and highlights the importance of a multi-faceted
approach to the development of safer AI systems.

强化学习来自人类反馈是一种训练 AI 系统与人类目标对齐的技术，但其自身存在的问题、局限性以及相关改进技术的概述，以及提出用于改善社会监督的审计和公开标准的重要性。