AI methods are used in societally important settings, ranging from credit to
employment to housing, and it is crucial to provide fairness in regard to
algorithmic decision making. Moreover, many settings are dynamic, with
populations responding to sequential decision policies. We introduce the study
of reinforcement learning (RL) with stepwise fairness constraints, requiring
group fairness at each time step. Our focus is on tabular episodic RL, and we
provide learning algorithms with strong theoretical guarantees in regard to
policy optimality and fairness violation. Our framework provides useful tools
to study the impact of fairness constraints in sequential settings and brings
up new challenges in RL.

介绍了在社会重要领域中使用人工智能算法进行算法决策的公平性问题，并介绍了强调每个时间步骤群体公平性的强化学习框架和学习算法。

具有分步公平约束的强化学习

Reinforcement Learning with Stepwise Fairness Constraints

Off-policy evaluation of sequential decision policies from observational data
is necessary in applications of batch reinforcement learning such as education
and healthcare. In such settings, however, unobserved variables confound
observed actions, rendering exact evaluation of new policies impossible, i.e.,
unidentifiable. We develop a robust approach that estimates sharp bounds on the
(unidentifiable) value of a given policy in an infinite-horizon problem given
data from another policy with unobserved confounding, subject to a sensitivity
model. We consider stationary or baseline unobserved confounding and compute
bounds by optimizing over the set of all stationary state-occupancy ratios that
agree with a new partially identified estimating equation and the sensitivity
model. We prove convergence to the sharp bounds as we collect more confounded
data. Although checking set membership is a linear program, the support
function is given by a difficult nonconvex optimization problem. We develop
approximations based on nonconvex projected gradient descent and demonstrate
the resulting bounds empirically.

通过数据策略辅助下的敏感性模型，我们开发了一种强健的方法，针对诸如教育和医疗等批量强化学习的应用中未被观察到的变量，估计了一个无限时间阶段内给定策略值的尖锐边界。我们证明，随着我们收集更多混淆数据，我们能够收敛于尖锐的边界。虽然检查集合成员身份是一个线性规划，但支持功能是由一个困难的非凸优化问题给出的。我们基于非凸投影梯度下降方法开发了一些近似，并在实证中演示了所得到的边界。