Most reinforcement learning algorithms with formal regret guarantees assume
all mistakes are reversible and rely on essentially trying all possible
options. This approach leads to poor outcomes when some mistakes are
irreparable or even catastrophic. We propose a variant of the contextual bandit
problem where the goal is to minimize the chance of catastrophe. Specifically,
we assume that the payoff each round represents the chance of avoiding
catastrophe that round, and try to maximize the product of payoffs (the overall
chance of avoiding catastrophe). To give the agent some chance of success, we
allow a limited number of queries to a mentor and assume a Lipschitz continuous
payoff function. We present an algorithm whose regret and rate of querying the
mentor both approach 0 as the time horizon grows, assuming a continuous 1D
state space and a relatively "simple" payoff function. We also provide a
matching lower bound: without the simplicity assumption: any algorithm either
constantly asks for help or is nearly guaranteed to cause catastrophe. Finally,
we identify the key obstacle to generalizing our algorithm to a
multi-dimensional state space.

通过假设每个回合的付出代表避免灾难的机会，我们提出了一种上下文匹配问题的变体，目标是尽量减少灾难的可能性，进而通过最大化付出的乘积来尽量避免灾难的总体机会。我们提供了一个算法，可以在时间范围增长时使后悔和对导师提问的频率都趋近于 0，假设有一个连续的 1D 状态空间和相对简单的付出函数。同时，我们提供了一个匹配的下界：在没有简单假设的情况下，任何算法要么持续寻求帮助，要么几乎肯定会造成灾难。最后，我们确定了将我们的算法推广到多维状态空间的关键障碍。

通过寻求帮助避免连续空间中的灾难

Avoiding Catastrophe in Continuous Spaces by Asking for Help

To take advantage of strategy commitment, a useful tactic of playing games, a
leader must learn enough information about the follower's payoff function.
However, this leaves the follower a chance to provide fake information and
influence the final game outcome. Through a carefully contrived payoff function
misreported to the learning leader, the follower may induce an outcome that
benefits him more, compared to the ones when he truthfully behaves.
We study the follower's optimal manipulation via such strategic behaviors in
extensive-form games. Followers' different attitudes are taken into account. An
optimistic follower maximizes his true utility among all game outcomes that can
be induced by some payoff function. A pessimistic follower only considers
misreporting payoff functions that induce a unique game outcome. For all the
settings considered in this paper, we characterize all the possible game
outcomes that can be induced successfully. We show that it is polynomial-time
tractable for the follower to find the optimal way of misreporting his private
payoff information. Our work completely resolves this follower's optimal
manipulation problem on an extensive-form game tree.

通过对追随者私有收益信息的误报，实现最优操纵是一个多项式时间可解的问题，不同态度的追随者对此有不同的态度。此项研究解决了这个广泛的问题。