In this paper we consider multi-objective reinforcement learning where the objectives are balanced using preferences. In practice, the preferences are often given in an adversarial manner, e.g., customers can be picky in many applications. We formalize this problem as an episodic learning problem on a Markov decision process, where transitions are unknown and a reward function is the inner product of a preference vector with pre-specified multi-objective reward functions. In the online setting, the agent receives a (adversarial) preference every episode and proposes policies to interact with the environment. We provide a model-based algorithm that achieves a regret bound $\widetilde{\mathcal{O}}\left({\sqrt{\min\{d,S\}\cdot H^3 SAK}}\right)$, where $d$ is the number of objectives, $S$ is the number of states, $A$ is the number of actions, $H$ is the length of the horizon, and $K$ is the number of episodes. Furthermore, we consider preference-free exploration, i.e., the agent first interacts with the environment without specifying any preference and then is able to accommodate arbitrary preference vectors up to $\epsilon$ error. Our proposed algorithm is provably efficient with a nearly optimal sample complexity $\widetilde{\mathcal{O}}\left({\frac{\min\{d,S\}\cdot H^4 SA}{\epsilon^2}}\right)$.

提出一种基于马尔可夫决策过程的实现多目标强化学习的模型，针对不确定性的reward函数，使用内积方法建立了一种新的衡量指标，探讨了在线学习以及基于Preference-free exploration的学习方式，并提出了一种轨迹复杂度几乎最优的算法。

迎合挑剔的顾客：多目标强化学习的遗憾界与探索复杂度