There is a growing body of work on learning from human feedback to align
various aspects of machine learning systems with human values and preferences.
We consider the setting of fairness in content moderation, in which human
feedback is used to determine how two comments -- referencing different
sensitive attribute groups -- should be treated in comparison to one another.
With a novel dataset collected from Prolific and MTurk, we find significant
gaps in fairness preferences depending on the race, age, political stance,
educational level, and LGBTQ+ identity of annotators. We also demonstrate that
demographics mentioned in text have a strong influence on how users perceive
individual fairness in moderation. Further, we find that differences also exist
in downstream classifiers trained to predict human preferences. Finally, we
observe that an ensemble, giving equal weight to classifiers trained on
annotations from different demographics, performs better for different
demographic intersections; compared to a single classifier that gives equal
weight to each annotation.

通过从人类反馈中学习，我们考虑在内容审查中公平性的设置，在比较两个评论时，人类反馈被用来确定如何处理涉及不同敏感属性组的评论。我们发现，与注释者的种族、年龄、政治立场、教育水平和 LGBTQ + 身份有关，公平偏好存在显著差异，并且文本中提到的人口统计学信息对用户感知个体公平性有着强烈影响。此外，我们发现在预测人类偏好的下游分类器中也存在差异。最后，我们观察到在给定相等权重的不同人口统计注释训练的集成模型中，针对不同人口统计交叉部分表现更好，相比于给每个注释相等权重的单个分类器。