Positive-unlabeled (PU) learning addresses the problem of learning a binary classifier from positive (P) and unlabeled (U) data. It is often applied to situations where negative (N) data are difficult to be fully labeled. However, collecting a non-representative N set that contains only a small portion of all possible N data can be much easier in many practical situations. This paper studies a novel classification framework which incorporates such biased N (bN) data in PU learning. The fact that the training N data are biased also makes our work very different from those of standard semi-supervised learning. We provide an empirical risk minimization-based method to address this PUbN classification problem. Our approach can be regarded as a variant of traditional example-reweighting algorithms, with the weight of each example computed through a preliminary step that draws inspiration from PU learning. We also derive an estimation error bound for the proposed method. Experimental results demonstrate the effectiveness of our algorithm in not only PUbN learning scenarios but also ordinary PU leaning scenarios on several benchmark datasets.

本文提出一种新的分类框架来解决二元分类中负数据种类过于多无法完全标注的情况，并引入一种基于实验风险最小化的方法来解决这个问题，方法中使用的每个示例的权重是通过受到正例样本-未标记负例样本学习的启发式预处理步骤计算的，并针对所提出的方法导出了估计误差边界。实验结果表明，该算法不仅在正-未标记负学习场景中，也在几个基准数据集上的普通正-未标记负学习场景中均具有有效性。

正样本、无标记样本和有偏负样本数据分类