Leveraging the wealth of unlabeled data produced in recent years provides great potential for improving supervised models. When the cost of acquiring labels is high, probabilistic active learning methods can be used to greedily select the most informative data points to be labeled. However, for many large-scale problems standard greedy procedures become computationally infeasible and suffer from negligible model change. In this paper, we introduce a novel Bayesian batch active learning approach that mitigates these issues. Our approach is motivated by approximating the complete data posterior of the model parameters. While naive batch construction methods result in correlated queries, our algorithm produces diverse batches that enable efficient active learning at scale. We derive interpretable closed-form solutions akin to existing active learning procedures for linear models, and generalize to arbitrary models using random projections. We demonstrate the benefits of our approach on several large-scale regression and classification tasks.

本研究提出一种基于贝叶斯批量主动学习方法来解决大规模监督模型中标签获取成本高的问题，从而利用大量未标记数据来改善模型性能。此方法通过逼近模型参数的完整数据后验概率，并使用随机投影技术来推广到任意模型，从而使批处理的数据选择更加多样，有效降低了计算复杂度，并在多个大规模回归和分类任务上得到了证实。

贝叶斯批次主动学习作为稀疏子集逼近