The diversity of contexts in which large language models (LLMs) are deployed
requires the ability to modify or customize default model behaviors to
incorporate nuanced requirements and preferences. A convenient interface to
specify such model adjustments is high-level verbal feedback, such as "Don't
use emojis when drafting emails to my boss." However, while writing high-level
feedback is far simpler than collecting annotations for reinforcement learning
from human feedback (RLHF), we find that simply prompting a model with such
feedback leads to overgeneralization of the feedback to contexts where it is
not relevant. We study the problem of incorporating verbal feedback without
such overgeneralization, inspiring a new method Contextualized Critiques with
Constrained Preference Optimization (C3PO). C3PO uses a piece of high-level
feedback to generate a small synthetic preference dataset specifying how the
feedback should (and should not) be applied. It then fine-tunes the model in
accordance with the synthetic preference data while minimizing the divergence
from the original model for prompts where the feedback does not apply. Our
experimental results indicate that our approach effectively applies verbal
feedback to relevant scenarios while preserving existing behaviors for other
contexts. For both human- and GPT-4-generated high-level feedback, C3PO
effectively adheres to the given feedback comparably to in-context baselines
while reducing overgeneralization by 30%.

在大语言模型中引入高级口头反馈以传达特定要求和偏好的能力是重要的，本文提出了一种名为 C3PO 的方法，通过生成小规模合成偏好数据集并最小化与原始模型的差异来有效地应用口头反馈，同时减少了过度泛化。

RLVF：无泛化的口头反馈学习

RLVF: Learning from Verbal Feedback without Overgeneralization

The ability to learn and refine behavior after deployment has become ever
more important for robots as we design them to operate in unstructured
environments like households. In this work, we design a new learning system
based on large language model (LLM), OLAF, that allows everyday users to teach
a robot using verbal corrections when the robot makes mistakes, e.g., by saying
"Stop what you're doing. You should move closer to the cup." A key feature of
OLAF is its ability to update the robot's visuomotor neural policy based on the
verbal feedback to avoid repeating mistakes in the future. This is in contrast
to existing LLM-based robotic systems, which only follow verbal commands or
corrections but not learn from them. We demonstrate the efficacy of our design
in experiments where a user teaches a robot to perform long-horizon
manipulation tasks both in simulation and on physical hardware, achieving on
average 20.0% improvement in policy success rate. Videos and more results are
at this https URL

我们设计了一种基于大型语言模型 (LLM) 的学习系统 OLAF，使得普通用户可以通过语音纠正教导机器人，从而更新机器人的视觉运动神经策略，以避免未来重复错误，并在实验中展示了在长期任务执行中的成功率平均提高了 20.0%。