Multi-objective reinforcement learning (MORL) is increasingly relevant due to
its resemblance to real-world scenarios requiring trade-offs between multiple
objectives. Catering to diverse user preferences, traditional reinforcement
learning faces amplified challenges in MORL. To address the difficulty of
training policies from scratch in MORL, we introduce demonstration-guided
multi-objective reinforcement learning (DG-MORL). This novel approach utilizes
prior demonstrations, aligns them with user preferences via corner weight
support, and incorporates a self-evolving mechanism to refine suboptimal
demonstrations. Our empirical studies demonstrate DG-MORL's superiority over
existing MORL algorithms, establishing its robustness and efficacy,
particularly under challenging conditions. We also provide an upper bound of
the algorithm's sample complexity.

利用先前示范、角重支持、自我演进机制和样本复杂度，我们引入了一种新型方法，即示范引导的多目标强化学习（DG-MORL），以解决多目标强化学习中从头开始训练策略的困难，并通过各种实验证明了 DG-MORL 在挑战性条件下的优越性、稳健性和有效性，同时提供了算法的样本复杂度上界。

示范引导的多目标强化学习

Demonstration Guided Multi-Objective Reinforcement Learning

Enhancing the instruction-following ability of Large Language Models (LLMs)
primarily demands substantial instruction-tuning datasets. However, the sheer
volume of these imposes a considerable computational burden and annotation
cost. To investigate a label-efficient instruction tuning method that allows
the model itself to actively sample subsets that are equally or even more
effective, we introduce a self-evolving mechanism DiverseEvol. In this process,
a model iteratively augments its training subset to refine its own performance,
without requiring any intervention from humans or more advanced LLMs. The key
to our data sampling technique lies in the enhancement of diversity in the
chosen subsets, as the model selects new data points most distinct from any
existing ones according to its current embedding space. Extensive experiments
across three datasets and benchmarks demonstrate the effectiveness of
DiverseEvol. Our models, trained on less than 8% of the original dataset,
maintain or improve performance compared with finetuning on full data. We also
provide empirical evidence to analyze the importance of diversity in
instruction data and the iterative scheme as opposed to one-time sampling. Our
code is publicly available at this https URL

通过引入自我演变机制 DiverseEvol，我们提出了一种标签高效的指令调整方法，该方法允许模型自己主动采样同样或更有效的子集来改善自身性能，而无需人类干预或更先进的 LLMs。在选择子集时，我们的数据采样技术的关键在于增强所选子集的多样性，使模型根据当前的嵌入空间选择与任何现有数据点都不同的新数据点。在三个数据集和基准测试中进行的大量实验证明了 DiverseEvol 的有效性。我们的模型在原始数据集的不到 8% 的训练基础上，与在完整数据上进行微调相比，性能保持或提高。我们还提供实证证据分析了多样性在指令数据中的重要性以及迭代方案与一次性采样的区别。我们的代码可以在此 https URL 公开获取。