Human alignment in large language models (LLMs) is an active area of
research. A recent groundbreaking work, direct preference optimization (DPO),
has greatly simplified the process from past work in reinforcement learning
from human feedback (RLHF) by bypassing the reward learning stage in RLHF. DPO,
after training, provides an implicit reward model. In this work, we make a
novel observation that this implicit reward model can by itself be used in a
bootstrapping fashion to further align the LLM. Our approach is to use the
rewards from a current LLM model to construct a preference dataset, which is
then used in subsequent DPO rounds. We incorporate refinements that debias the
length of the responses and improve the quality of the preference dataset to
further improve our approach. Our approach, named self-alignment with DPO
ImpliCit rEwards (DICE), shows great improvements in alignment and achieves
superior performance than Gemini Pro on AlpacaEval 2, reaching 27.55%
length-controlled win rate against GPT-4 Turbo, but with only 8B parameters and
no external feedback. Our code is available at this https URL

使用直接偏好优化（DPO）的隐式奖励模型，我们提出了自对齐方法，命名为 DPO 隐式奖励自对齐（DICE），以改进大语言模型的对齐性能和质量。

使用 DPO 隐式奖励进行自助式语言模型训练

Bootstrapping Language Models with DPO Implicit Rewards

With the rise of large language models (LLMs), ensuring they embody the
principles of being helpful, honest, and harmless (3H), known as Human
Alignment, becomes crucial. While existing alignment methods like RLHF, DPO,
etc., effectively fine-tune LLMs to match preferences in the preference
dataset, they often lead LLMs to highly receptive human input and external
evidence, even when this information is poisoned. This leads to a tendency for
LLMs to be Adaptive Chameleons when external evidence conflicts with their
parametric memory. This exacerbates the risk of LLM being attacked by external
poisoned data, which poses a significant security risk to LLM system
applications such as Retrieval-augmented generation (RAG). To address the
challenge, we propose a novel framework: Dialectical Alignment (DA), which (1)
utilizes AI feedback to identify optimal strategies for LLMs to navigate
inter-context conflicts and context-memory conflicts with different external
evidence in context window (i.e., different ratios of poisoned factual
contexts); (2) constructs the SFT dataset as well as the preference dataset
based on the AI feedback and strategies above; (3) uses the above datasets for
LLM alignment to defense poisoned context attack while preserving the
effectiveness of in-context knowledge editing. Our experiments show that the
dialectical alignment model improves poisoned data attack defense by 20 and
does not require any additional prompt engineering or prior declaration of
``you may be attacked`` to the LLMs' context window.

利用人工智能反馈，提出了一种新颖的方案：辩证对齐（Dialectical Alignment）模型，用于调整大语言模型在不同外部证据冲突下的内部状态，以抵御被污染的数据攻击，提高系统的安全性。

辩证统一：解决 LLM 的 3H 与安全威胁的张力

Dialectical Alignment: Resolving the Tension of 3H and Security Threats  of LLMs

Video generation has witnessed significant advancements, yet evaluating these
models remains a challenge. A comprehensive evaluation benchmark for video
generation is indispensable for two reasons: 1) Existing metrics do not fully
align with human perceptions; 2) An ideal evaluation system should provide
insights to inform future developments of video generation. To this end, we
present VBench, a comprehensive benchmark suite that dissects "video generation
quality" into specific, hierarchical, and disentangled dimensions, each with
tailored prompts and evaluation methods. VBench has three appealing properties:
1) Comprehensive Dimensions: VBench comprises 16 dimensions in video generation
(e.g., subject identity inconsistency, motion smoothness, temporal flickering,
and spatial relationship, etc). The evaluation metrics with fine-grained levels
reveal individual models' strengths and weaknesses. 2) Human Alignment: We also
provide a dataset of human preference annotations to validate our benchmarks'
alignment with human perception, for each evaluation dimension respectively. 3)
Valuable Insights: We look into current models' ability across various
evaluation dimensions, and various content types. We also investigate the gaps
between video and image generation models. We will open-source VBench,
including all prompts, evaluation methods, generated videos, and human
preference annotations, and also include more video generation models in VBench
to drive forward the field of video generation.

通过 VBench 系统，我们提供了一个全面的视频生成评估基准，将视频生成质量分解为特定的、分层的、分离的维度，并为每个维度提供了定制的提示和评估方法；我们还提供了人类喜好注释的数据集，验证了我们基准与人类知觉的一致性；在各个评估维度和各种内容类型上，我们研究了当前模型在视频生成能力上的差异，并探究了视频和图像生成模型之间的差距。

VBench: 视频生成模型综合基准套件

VBench: Comprehensive Benchmark Suite for Video Generative Models

The recent advancement of large language models (LLMs) has been achieved
through a combo of instruction tuning and human alignment. However, building
manually crafted instruction datasets and performing human alignment become the
bottleneck for scaling the development of LLMs. In this paper, we exploit the
idea of leveraging AI models in lieu of humans as the teacher to train student
LLMs. Our method is inspired by how human students refine their writing skills
by following the rubrics and learning from the revisions offered by their
tutors. Specifically, we employ a teacher LLM to create a curriculum for
instruction tuning of the student LLM, namely Curriculum Instruction TunING
(CITING). It encompasses two main steps: (1) the teacher LLM crafts the rubrics
for evaluating the answers corresponding to various types of questions, and (2)
the student LLM learns to follow the rubrics and perform self-correction from
the revision made by the teacher. We further iteratively carry out it to embody
the procedure of CITING. We compare CITING to a series of state-of-the-art
baselines on four datasets. Our method demonstrates strong improvement in terms
of articulate, in-depth, and comprehensive by GPT-4 evaluation. Specifically,
it achieves an average winning rate of 79.4% over SFT, 73.4% over RLHF, 78.1%
over RRHF, and 76.3% over RAFT, respectively.

利用人工智能模型替代人类作为教师，通过研究生成学生成绩的修订，构建了 Curriculum Instruction TunING (CITING) 方法，提高了大型语言模型的表达、深度和全面性能，在 GPT-4 评估上取得了 79.4% 的胜率。