We present Preference Flow Matching (PFM), a new framework for
preference-based reinforcement learning (PbRL) that streamlines the integration
of preferences into an arbitrary class of pre-trained models. Existing PbRL
methods require fine-tuning pre-trained models, which presents challenges such
as scalability, inefficiency, and the need for model modifications, especially
with black-box APIs like GPT-4. In contrast, PFM utilizes flow matching
techniques to directly learn from preference data, thereby reducing the
dependency on extensive fine-tuning of pre-trained models. By leveraging
flow-based models, PFM transforms less preferred data into preferred outcomes,
and effectively aligns model outputs with human preferences without relying on
explicit or implicit reward function estimation, thus avoiding common issues
like overfitting in reward models. We provide theoretical insights that support
our method's alignment with standard PbRL objectives. Experimental results
indicate the practical effectiveness of our method, offering a new direction in
aligning a pre-trained model to preference.

Preference Flow Matching (PFM) 是一种新的偏好强化学习（PbRL）框架，通过利用流匹配技术直接从偏好数据中学习，从而减少对预训练模型的大量微调的依赖，有效地将模型输出与人类偏好对齐，避免了奖励模型过拟合等常见问题。

偏好匹配与流匹配

Preference Alignment with Flow Matching

The Pretrained Foundation Models (PFMs) are regarded as the foundation for
various downstream tasks with different data modalities. A pretrained
foundation model, such as BERT, GPT-3, MAE, DALLE-E, and ChatGPT, is trained on
large-scale data which provides a reasonable parameter initialization for a
wide range of downstream applications. The idea of pretraining behind PFMs
plays an important role in the application of large models. Different from
previous methods that apply convolution and recurrent modules for feature
extractions, the generative pre-training (GPT) method applies Transformer as
the feature extractor and is trained on large datasets with an autoregressive
paradigm. Similarly, the BERT apples transformers to train on large datasets as
a contextual language model. Recently, the ChatGPT shows promising success on
large language models, which applies an autoregressive language model with zero
shot or few show prompting. With the extraordinary success of PFMs, AI has made
waves in a variety of fields over the past few years. Considerable methods,
datasets, and evaluation metrics have been proposed in the literature, the need
is raising for an updated survey. This study provides a comprehensive review of
recent research advancements, current and future challenges, and opportunities
for PFMs in text, image, graph, as well as other data modalities. We first
review the basic components and existing pretraining in natural language
processing, computer vision, and graph learning. We then discuss other advanced
PFMs for other data modalities and unified PFMs considering the data quality
and quantity. Besides, we discuss relevant research about the fundamentals of
the PFM, including model efficiency and compression, security, and privacy.
Finally, we lay out key implications, future research directions, challenges,
and open problems.

本研究综述了最近的预训练基础模型技术的研究进展，重点探讨了这些技术在文本、图像、图形以及其他数据模态中的应用前景、挑战和机遇，同时也讨论了这些技术的基本组成、现有预训练方法和未来趋势。