Mathematical reasoning presents a significant challenge for Large Language
Models (LLMs) due to the extensive and precise chain of reasoning required for
accuracy. Ensuring the correctness of each reasoning step is critical. To
address this, we aim to enhance the robustness and factuality of LLMs by
learning from human feedback. However, Direct Preference Optimization (DPO) has
shown limited benefits for long-chain mathematical reasoning, as models
employing DPO struggle to identify detailed errors in incorrect answers. This
limitation stems from a lack of fine-grained process supervision. We propose a
simple, effective, and data-efficient method called Step-DPO, which treats
individual reasoning steps as units for preference optimization rather than
evaluating answers holistically. Additionally, we have developed a data
construction pipeline for Step-DPO, enabling the creation of a high-quality
dataset containing 10K step-wise preference pairs. We also observe that in DPO,
self-generated data is more effective than data generated by humans or GPT-4,
due to the latter's out-of-distribution nature. Our findings demonstrate that
as few as 10K preference data pairs and fewer than 500 Step-DPO training steps
can yield a nearly 3% gain in accuracy on MATH for models with over 70B
parameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves
scores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively,
surpassing a series of closed-source models, including GPT-4-1106,
Claude-3-Opus, and Gemini-1.5-Pro. Our code, data, and models are available at
this https URL

我们提出了一种名为 Step-DPO 的简单、有效和数据高效的方法，它将每个推理步骤作为单位进行优化，而不是对答案进行整体评估。通过构建 Step-DPO 的数据集，我们观察到自动生成的数据比人类或 GPT-4 生成的数据更有效，我们的发现表明，只需 10K 个偏好数据对和少于 500 个 Step-DPO 训练步骤，即可使具有超过 70B 参数的模型在 MATH 方面的准确性提高近 3%。值得注意的是，将 Step-DPO 应用于 Qwen2-72B-Instruct 时，在 MATH 和 GSM8K 的测试集上分别达到 70.8% 和 94.0% 的分数，超过了一系列闭源模型，包括 GPT-4-1106、Claude-3-Opus 和 Gemini-1.5-Pro。

Step-DPO：Step-wise 偏好优化长链推理的 LLMs

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of  LLMs

Image captioning has long been regarded as a fundamental task in visual
understanding. Recently, however, few large vision-language model (LVLM)
research discusses model's image captioning performance because of the outdated
short-caption benchmarks and unreliable evaluation metrics. In this work, we
propose to benchmark detail image caption task by curating high-quality
evaluation datasets annotated by human experts, GPT-4V and Gemini-1.5-Pro. We
also design a more reliable caption evaluation metric called CAPTURE (CAPtion
evaluation by exTracting and coUpling coRE information). CAPTURE extracts
visual elements, e.g., objects, attributes and relations from captions, and
then matches these elements through three stages, achieving the highest
consistency with expert judgements over other rule-based or model-based caption
metrics. The proposed benchmark and metric provide reliable evaluation for
LVLM's detailed image captioning ability. Guided by this evaluation, we further
explore to unleash LVLM's detail caption capabilities by synthesizing
high-quality data through a five-stage data construction pipeline. Our pipeline
only uses a given LVLM itself and other open-source tools, without any human or
GPT-4V annotation in the loop. Experiments show that the proposed data
construction strategy significantly improves model-generated detail caption
data quality for LVLMs with leading performance, and the data quality can be
further improved in a self-looping paradigm. All code and dataset will be
publicly available at this https URL

图像标注长期以来一直被视为视觉理解的基本任务。最近，由于过时的短字幕基准和不可靠的评估指标，很少有大规模视觉 - 语言模型（LVLM）研究讨论模型的图像标注性能。本文提出了通过由人类专家注释的高质量评估数据集 GPT-4V 和 Gemini-1.5-Pro 来评估详细图像标注任务的基准。我们还设计了一种更可靠的字幕评估指标，称为 CAPTURE（通过提取和耦合核心信息进行字幕评估）。CAPTURE 从字幕中提取视觉元素（例如对象、属性和关系），然后通过三个阶段匹配这些元素，以实现与专家判断最高的一致性，超过其他基于规则或基于模型的字幕评估指标。所提出的基准和指标为 LVLM 的详细图像标注能力提供了可靠的评估。在此评估的指导下，我们通过一个五阶段的数据构建流程进一步探索释放 LVLM 的详细字幕能力。我们的流程只使用给定的 LVLM 本身和其他开源工具，没有任何人工或 GPT-4V 的注释。实验证明，所提出的数据构建策略显著提高了具有领先性能的 LVLM 生成的详细字幕数据的质量，并且在自我循环的范式中可以进一步提高数据质量。代码和数据集将在此 https URL 公开提供。

细节图像描述的基准测试与改进

Benchmarking and Improving Detail Image Caption

Recent advances in text-to-image generation have made remarkable progress in
synthesizing realistic human photos conditioned on given text prompts. However,
existing personalized generation methods cannot simultaneously satisfy the
requirements of high efficiency, promising identity (ID) fidelity, and flexible
text controllability. In this work, we introduce PhotoMaker, an efficient
personalized text-to-image generation method, which mainly encodes an arbitrary
number of input ID images into a stack ID embedding for preserving ID
information. Such an embedding, serving as a unified ID representation, can not
only encapsulate the characteristics of the same input ID comprehensively, but
also accommodate the characteristics of different IDs for subsequent
integration. This paves the way for more intriguing and practically valuable
applications. Besides, to drive the training of our PhotoMaker, we propose an
ID-oriented data construction pipeline to assemble the training data. Under the
nourishment of the dataset constructed through the proposed pipeline, our
PhotoMaker demonstrates better ID preservation ability than test-time
fine-tuning based methods, yet provides significant speed improvements,
high-quality generation results, strong generalization capabilities, and a wide
range of applications. Our project page is available at
this https URL

研究对文本到图像的生成进行了进一步的提升，在保证高效率、有辨识度的身份和灵活文本控制性等要求的前提下，提出了 PhotoMaker 方法。通过将输入的身份图像编码为堆叠的 ID 嵌入来维护身份信息，该嵌入不仅可以全面地表达相同输入身份的特征，还可以适应不同身份的特征进行融合，从而实现更具吸引力和实用价值的应用。