We study online learning problems in constrained Markov decision processes
(CMDPs) with adversarial losses and stochastic hard constraints. We consider
two different scenarios. In the first one, we address general CMDPs, where we
design an algorithm that attains sublinear regret and cumulative positive
constraints violation. In the second scenario, under the mild assumption that a
policy strictly satisfying the constraints exists and is known to the learner,
we design an algorithm that achieves sublinear regret while ensuring that the
constraints are satisfied at every episode with high probability. To the best
of our knowledge, our work is the first to study CMDPs involving both
adversarial losses and hard constraints. Indeed, previous works either focus on
much weaker soft constraints--allowing for positive violation to cancel out
negative ones--or are restricted to stochastic losses. Thus, our algorithms can
deal with general non-stationary environments subject to requirements much
stricter than those manageable with state-of-the-art algorithms. This enables
their adoption in a much wider range of real-world applications, ranging from
autonomous driving to online advertising and recommender systems.

我们研究带有对抗性损失和随机硬约束的约束马尔可夫决策过程（CMDP）中的在线学习问题。我们设计了两种不同的情景，第一种是在一般 CMDP 中实现次线性遗憾和累积正约束违规的算法。第二种情景下，我们假设策略存在且对学习者已知，并设计了一个算法，确保次线性遗憾的同时，高概率满足所有回合的约束。据我们所知，我们的工作是第一个研究同时涉及对抗性损失和硬约束的 CMDP。这些算法可处理一般非平稳环境中的要求，要求比现有算法处理的要严格得多，从而能够在更广范围的实际应用中采用，包括自动驾驶、在线广告和推荐系统。