The agency problem emerges in today's large scale machine learning tasks,
where the learners are unable to direct content creation or enforce data
collection. In this work, we propose a theoretical framework for aligning
economic interests of different stakeholders in the online learning problems
through contract design. The problem, termed \emph{contractual reinforcement
learning}, naturally arises from the classic model of Markov decision
processes, where a learning principal seeks to optimally influence the agent's
action policy for their common interests through a set of payment rules
contingent on the realization of next state. For the planning problem, we
design an efficient dynamic programming algorithm to determine the optimal
contracts against the far-sighted agent. For the learning problem, we introduce
a generic design of no-regret learning algorithms to untangle the challenges
from robust design of contracts to the balance of exploration and exploitation,
reducing the complexity analysis to the construction of efficient search
algorithms. For several natural classes of problems, we design tailored search
algorithms that provably achieve $\tilde{O}(\sqrt{T})$ regret. We also present
an algorithm with $\tilde{O}(T^{2/3})$ for the general problem that improves
the existing analysis in online contract design with mild technical
assumptions.

通过合同设计解决在线学习问题中不同利益相关方的经济利益一致性，提出一种理论框架来解决机器学习中的代理问题，并设计了有效的动态规划算法和无悔学习算法以实现最优合同和平衡探索与开发的挑战。

契约强化学习：用无形之手牵引力量

Contractual Reinforcement Learning: Pulling Arms with Invisible Hands

We study repeated first-price auctions and general repeated Bayesian games
between two players, where one player, the learner, employs a no-regret
learning algorithm, and the other player, the optimizer, knowing the learner's
algorithm, strategizes to maximize its own utility. For a commonly used class
of no-regret learning algorithms called mean-based algorithms, we show that (i)
in standard (i.e., full-information) first-price auctions, the optimizer cannot
get more than the Stackelberg utility -- a standard benchmark in the
literature, but (ii) in Bayesian first-price auctions, there are instances
where the optimizer can achieve much higher than the Stackelberg utility.
On the other hand, Mansour et al. (2022) showed that a more sophisticated
class of algorithms called no-polytope-swap-regret algorithms are sufficient to
cap the optimizer's utility at the Stackelberg utility in any repeated Bayesian
game (including Bayesian first-price auctions), and they pose the open question
whether no-polytope-swap-regret algorithms are necessary to cap the optimizer's
utility. For general Bayesian games, under a reasonable and necessary
condition, we prove that no-polytope-swap-regret algorithms are indeed
necessary to cap the optimizer's utility and thus answer their open question.
For Bayesian first-price auctions, we give a simple improvement of the standard
algorithm for minimizing the polytope swap regret by exploiting the structure
of Bayesian first-price auctions.

我们研究了重复的一阶售价拍卖和一般重复贝叶斯博弈的情况，在这种情况下，一个参与者（学习者）采用了一个无悔学习算法，而另一个参与者（优化者）在了解学习者的算法的情况下，策略化地追求自己的效用最大化。 对于一类被称为基于均值的无悔学习算法，我们证明：（i）在标准（即完全信息）的一阶售价拍卖中，优化者不能获得超过 Stackelberg 效用的效用 -- 这是文献中的标准基准，但是（ii）在贝叶斯一阶售价拍卖中，存在优化者可以获得远高于 Stackelberg 效用的实例。 另一方面，Mansour 等人（2022）证明了一类更复杂的算法，称为无多面体交换后悔算法可以将优化者的效用限制在任意重复贝叶斯博弈（包括贝叶斯一阶售价拍卖）的 Stackelberg 效用上，并提出是否有必要使用无多面体交换后悔算法来限制优化者的效用。对于一般的贝叶斯博弈，在一个合理且必要的条件下，我们证明了无多面体交换后悔算法确实是将优化者的效用限制在 Stackelberg 效用上的必要条件，从而回答了他们的开放性问题。对于贝叶斯一阶售价拍卖，我们通过利用贝叶斯一阶售价拍卖的结构给出了一个简单的改进标准算法来最小化多面体交换后悔。