Learning control policies to perform complex robotics tasks from human
preference data presents significant challenges. On the one hand, the
complexity of such tasks typically requires learning policies to perform a
variety of subtasks, then combining them to achieve the overall goal. At the
same time, comprehensive, well-engineered reward functions are typically
unavailable in such problems, while limited human preference data often is;
making efficient use of such data to guide learning is therefore essential.
Methods for learning to perform complex robotics tasks from human preference
data must overcome both these challenges simultaneously. In this work, we
introduce DIPPER: Direct Preference Optimization to Accelerate
Primitive-Enabled Hierarchical Reinforcement Learning, an efficient
hierarchical approach that leverages direct preference optimization to learn a
higher-level policy and reinforcement learning to learn a lower-level policy.
DIPPER enjoys improved computational efficiency due to its use of direct
preference optimization instead of standard preference-based approaches such as
reinforcement learning from human feedback, while it also mitigates the
well-known hierarchical reinforcement learning issues of non-stationarity and
infeasible subgoal generation due to our use of primitive-informed
regularization inspired by a novel bi-level optimization formulation of the
hierarchical reinforcement learning problem. To validate our approach, we
perform extensive experimental analysis on a variety of challenging robotics
tasks, demonstrating that DIPPER outperforms hierarchical and non-hierarchical
baselines, while ameliorating the non-stationarity and infeasible subgoal
generation issues of hierarchical reinforcement learning.

DIPPER 是一种高效的分层方法，结合直接优化和强化学习，在从人类偏好数据中学习更高级策略和更低级策略的基础上，解决了从人类偏好数据学习复杂机器人任务的挑战。

DIPPER：直接优化偏好以加速基元级层次强化学习

DIPPER: Direct Preference Optimization to Accelerate Primitive-Enabled  Hierarchical Reinforcement Learning

This paper presents an exploration of preference learning in text-to-motion
generation. We find that current improvements in text-to-motion generation
still rely on datasets requiring expert labelers with motion capture systems.
Instead, learning from human preference data does not require motion capture
systems; a labeler with no expertise simply compares two generated motions.
This is particularly efficient because evaluating the model's output is easier
than gathering the motion that performs a desired task (e.g. backflip). To
pioneer the exploration of this paradigm, we annotate 3,528 preference pairs
generated by MotionGPT, marking the first effort to investigate various
algorithms for learning from preference data. In particular, our exploration
highlights important design choices when using preference data. Additionally,
our experimental results show that preference learning has the potential to
greatly improve current text-to-motion generative models. Our code and dataset
are publicly available at
https://github.com/THU-LYJ-Lab/InstructMotion}{this https URL
to further facilitate research in this area.

该论文探讨了在文本到动作生成中的首选学习，指出当前的文本到动作生成仍依赖于需要具备专业知识的数据集和动作捕捉系统；而从人类偏好数据学习则不需要动作捕捉系统，只需没有专业知识的标注人员比较两个生成的动作。我们提供了 3,528 个由 MotionGPT 生成的首选对进行注释的数据集，标志着从偏好数据进行学习的首次尝试，并强调在使用偏好数据时的重要设计选择。此外，我们的实验结果表明偏好学习有着极大的潜力来改进当前的文本到动作生成模型。我们的代码和数据集已公开在 https://github.com/THU-LYJ-Lab/InstructMotion 以进一步促进该领域的研究。

探索人类偏好的文本生成动作

Exploring Text-to-Motion Generation with Human Preference

Modern instruction-tuned models have become highly capable in text generation
tasks such as summarization, and are expected to be released at a steady pace.
In practice one may now wish to choose confidently, but with minimal effort,
the best performing summarization model when applied to a new domain or
purpose. In this work, we empirically investigate the test sample size
necessary to select a preferred model in the context of news summarization.
Empirical results reveal that comparative evaluation converges quickly for both
automatic and human evaluation, with clear preferences for a system emerging
from under 100 examples. The human preference data allows us to quantify how
well automatic scores can reproduce preference rankings across a variety of
downstream summarization tasks. We find that, while automatic metrics are
stable at smaller sample sizes, only some automatic metrics are able to
moderately predict model win rates according to human preference.

在这项研究中，我们通过实证研究了在新闻摘要领域选择最佳性能的模型所需的测试样本大小，发现只需要少于 100 个样本即可收敛，并且人类偏好数据可以在各种下游摘要任务中量化自动评分的能力。