We initiate the study of multi-stage episodic reinforcement learning under
adversarial corruptions in both the rewards and the transition probabilities of
the underlying system extending recent results for the special case of
stochastic bandits. We provide a framework which modifies the aggressive
exploration enjoyed by existing reinforcement learning approaches based on
"optimism in the face of uncertainty", by complementing them with principles
from "action elimination". Importantly, our framework circumvents the major
challenges posed by naively applying action elimination in the RL setting, as
formalized by a lower bound we demonstrate. Our framework yields efficient
algorithms which (a) attain near-optimal regret in the absence of corruptions
and (b) adapt to unknown levels corruption, enjoying regret guarantees which
degrade gracefully in the total corruption encountered. To showcase the
generality of our approach, we derive results for both tabular settings (where
states and actions are finite) as well as linear-function-approximation
settings (where the dynamics and rewards admit a linear underlying
representation). Notably, our work provides the first sublinear regret
guarantee which accommodates any deviation from purely i.i.d. transitions in
the bandit-feedback model for episodic reinforcement learning.

我们提出了一个框架，结合 “不确定性中的乐观主义” 和 “动作消除” 这两个策略，以解决领域中的非随机腐败问题，从而有效地实现了多阶段情节强化学习。

强化学习中具有防腐能力的探索策略

Corruption-robust exploration in episodic reinforcement learning

Learning how to act when there are many available actions in each state is a
challenging task for Reinforcement Learning (RL) agents, especially when many
of the actions are redundant or irrelevant. In such cases, it is sometimes
easier to learn which actions not to take. In this work, we propose the
Action-Elimination Deep Q-Network (AE-DQN) architecture that combines a Deep RL
algorithm with an Action Elimination Network (AEN) that eliminates sub-optimal
actions. The AEN is trained to predict invalid actions, supervised by an
external elimination signal provided by the environment. Simulations
demonstrate a considerable speedup and added robustness over vanilla DQN in
text-based games with over a thousand discrete actions.

该研究提出了一种名为 AE-DQN 的深度强化学习算法，该算法结合了 Action Elimination Network，并且通过外部环境的淘汰信号来优化选择更优的行为，该算法在纯文本游戏中取得了显著的优势。