We consider the batch (off-line) policy learning problem in the infinite
horizon Markov Decision Process. Motivated by mobile health applications, we
focus on learning a policy that maximizes the long-term average reward. We
propose a doubly robust estimator for the average reward and show that it
achieves semiparametric efficiency. Further we develop an optimization
algorithm to compute the optimal policy in a parameterized stochastic policy
class. The performance of the estimated policy is measured by the difference
between the optimal average reward in the policy class and the average reward
of the estimated policy and we establish a finite-sample regret guarantee. The
performance of the method is illustrated by simulation studies and an analysis
of a mobile health study promoting physical activity.

研究提出了一种基于无限时段马尔可夫决策过程的批量（线下）策略学习问题，生成最大长期平均奖励的策略，并利用双重稳健估计量和优化算法实现了半参数效率计算。该研究还通过模拟​​研究和移动健康推广体育锻炼的分析来展示估算策略的性能。

平均奖励马尔科夫决策过程中的批量策略学习

Batch Policy Learning in Average Reward Markov Decision Processes

We initiate the study of fairness in reinforcement learning, where the
actions of a learning algorithm may affect its environment and future rewards.
Our fairness constraint requires that an algorithm never prefers one action
over another if the long-term (discounted) reward of choosing the latter action
is higher. Our first result is negative: despite the fact that fairness is
consistent with the optimal policy, any learning algorithm satisfying fairness
must take time exponential in the number of states to achieve non-trivial
approximation to the optimal policy. We then provide a provably fair polynomial
time algorithm under an approximate notion of fairness, thus establishing an
exponential gap between exact and approximate fairness

研究强化学习中的公平性问题，探讨算法选择对环境和未来奖励的影响，提出公平性约束条件，尽管该条件与最优策略一致，但满足公平性的学习算法必须花费时间指数级才能达到对最优策略的非平凡逼近，提出在近似公平性约束下的多项式时间算法，从而建立了精确公平性和近似公平性之间的指数差距。

强化学习中的公正性

Fairness in Reinforcement Learning

In this paper we study the online learning problem involving rested and
restless multiarmed bandits with multiple plays. The system consists of a
single player/user and a set of K finite-state discrete-time Markov chains
(arms) with unknown state spaces and statistics. At each time step the player
can play M arms. The objective of the user is to decide for each step which M
of the K arms to play over a sequence of trials so as to maximize its long term
reward. The restless multiarmed bandit is particularly relevant to the
application of opportunistic spectrum access (OSA), where a (secondary) user
has access to a set of K channels, each of time-varying condition as a result
of random fading and/or certain primary users' activities.

本文研究了涉及休息和不休息的多臂赌博机和多次游戏的在线学习问题，在每个时间步骤，用户可以玩 M 支手臂，其目标是决定每一步要播放哪些 K 支手臂，以在一系列试验中最大化其长期奖励，尤其与机会式频谱接入（OSA）的应用相关。