BriefGPT.xyz
Aug, 2023
近端策略优化实战:操纵输出标记长度
Proximal Policy Optimization Actual Combat: Manipulating Output Tokenizer Length
HTML
PDF
Miao Fan, Chen Hu, Shuchang Zhou
TL;DR
通过使用奖励模型和 Proximal Policy Optimization(PPO)来操控模型生成的输出 tokenizer 长度的新任务,实验证实PPO在操控输出tokenizer长度以及训练效果方面的有效性和发展潜力。
Abstract
The
reinforcement learning from human feedback
(RLHF) plays a pivotal role in shaping the impact of
large language models
(LLMs), contributing significantly to controlling
→