数据集偏移下的广义贝叶斯量化学习

论文标题

数据集偏移下的广义贝叶斯量化学习

Generalized Bayes Quantification Learning under Dataset Shift

论文作者

Fiksel, Jacob, Datta, Abhirup, Amouzou, Agbessi, Zeger, Scott

论文摘要

量化学习是使用对不同人群训练的分类器的预测来对测试人群进行患病率估算的任务。定量方法假设分类器的敏感性和特异性是从培训到测试人群的完美或可运输的。在存在数据集转移的情况下，这些假设是不合适的，当培训人群中的错误分类速率不能代表测试人群的人群时。数据集偏移下的量化仅针对单级（分类）预测，并在测试人群的一小部分中对真实标签进行完美的了解。我们建议使用概率分类器中的整个组成预测，并允许在有限的标记测试数据的真实类标签中不确定性。我们不使用完整的模型，而是使用无模型的贝叶斯估计方程方法来仅基于第一瞬间假设来组成数据。这个想法通常在贝叶斯组成数据分析中很有用，因为它对组成数据的不同生成机制是可靠的，并将分类输出作为特殊情况。我们展示了我们的方法如何作为特殊情况产生现有的量化方法。讨论了使用多个分类器的预测，该集合GBQL扩展到较差的分类器的包含。我们使用圆形和损失函数的舍入近似值概述了快速有效的吉布斯采样器。我们还建立了后验一致性，渐近正态性和GBQL间隔估计值的有效覆盖范围以及有限的样品后浓度率。通过模拟和分析具有明显数据集移位的真实数据，可以证明GBQL的经验性能。

Quantification learning is the task of prevalence estimation for a test population using predictions from a classifier trained on a different population. Quantification methods assume that the sensitivities and specificities of the classifier are either perfect or transportable from the training to the test population. These assumptions are inappropriate in the presence of dataset shift, when the misclassification rates in the training population are not representative of those for the test population. Quantification under dataset shift has been addressed only for single-class (categorical) predictions and assuming perfect knowledge of the true labels on a small subset of the test population. We propose generalized Bayes quantification learning (GBQL) that uses the entire compositional predictions from probabilistic classifiers and allows for uncertainty in true class labels for the limited labeled test data. Instead of positing a full model, we use a model-free Bayesian estimating equation approach to compositional data based only on a first-moment assumption. The idea will be useful in Bayesian compositional data analysis in general as it is robust to different generating mechanisms for compositional data and includes categorical outputs as a special case. We show how our method yields existing quantification approaches as special cases. Extension to an ensemble GBQL that uses predictions from multiple classifiers yielding inference robust to inclusion of a poor classifier is discussed. We outline a fast and efficient Gibbs sampler using a rounding and coarsening approximation to the loss functions. We also establish posterior consistency, asymptotic normality and valid coverage of interval estimates from GBQL, as well as finite sample posterior concentration rate. Empirical performance of GBQL is demonstrated through simulations and analysis of real data with evident dataset shift.

下载PDF全文

下载文献需遵守相关版权规定

论文标题