Expert imitation, behavioral diversity, and fairness preferences give rise to preferences in sequential decision making domains that do not decompose additively across time. We introduce the class of convex Markov games that allow general convex preferences over occupancy measures. Despite infinite time horizon and strictly higher generality than Markov games, pure strategy Nash equilibria exist under strict convexity. Furthermore, equilibria can be approximated efficiently by performing gradient descent on an upper bound of exploitability. Our experiments imitate human choices in ultimatum games, reveal novel solutions to the repeated prisoner's dilemma, and find fair solutions in a repeated asymmetric coordination game. In the prisoner's dilemma, our algorithm finds a policy profile that deviates from observed human play only slightly, yet achieves higher per-player utility while also being three orders of magnitude less exploitable.

本研究针对多智能体学习中非线性偏好的问题，提出了凸马尔可夫博弈的框架，该框架允许对状态占用度量的广泛凸偏好进行处理。实验结果表明，该算法在囚徒困境中提供了高效的公平解，同时在模仿人类决策时能显著提高单个参与者的效用。 

凸马尔可夫博弈：多智能体学习中的公平性、模仿和创造性框架