论文标题
自信伪造的伪造标签分配
Confident Sinkhorn Allocation for Pseudo-Labeling
论文作者
论文摘要
半监督学习是减少机器学习对标记数据的依赖性的关键工具。它通过使用预审预定的模型或数据增强来利用固有的空间和语义结构来成功应用于结构化数据,例如图像和自然语言。但是,当数据没有适当的结构或不变时,这些方法不适用。由于它们的简单性,可以在没有任何域假设的情况下广泛使用伪标记(PL)方法。但是,PL中的贪婪机制对阈值很敏感,如果由于过度自信而进行分配错误,则表现不佳。从理论上讲,本文研究了不确定性对伪标记的作用,并提出了自信的sndhorn分配(CSA),该分配通过最佳运输来鉴定出最佳的伪标签分配,仅到置信度得分较高的样品。 CSA的表现优于这个半监督学习的实际重要领域的当前最新领域。此外,我们建议使用积分概率指标来扩展和改进依赖于Kullback-Leibler(KL)Divergence的现有PACBayes BOND,用于集合模型。我们的代码可在https://github.com/amzn/confident-sinkhorn-polation上公开获取。
Semi-supervised learning is a critical tool in reducing machine learning's dependence on labeled data. It has been successfully applied to structured data, such as images and natural language, by exploiting the inherent spatial and semantic structure therein with pretrained models or data augmentation. These methods are not applicable, however, when the data does not have the appropriate structure, or invariances. Due to their simplicity, pseudo-labeling (PL) methods can be widely used without any domain assumptions. However, the greedy mechanism in PL is sensitive to a threshold and can perform poorly if wrong assignments are made due to overconfidence. This paper studies theoretically the role of uncertainty to pseudo-labeling and proposes Confident Sinkhorn Allocation (CSA), which identifies the best pseudo-label allocation via optimal transport to only samples with high confidence scores. CSA outperforms the current state-of-the-art in this practically important area of semi-supervised learning. Additionally, we propose to use the Integral Probability Metrics to extend and improve the existing PACBayes bound which relies on the Kullback-Leibler (KL) divergence, for ensemble models. Our code is publicly available at https://github.com/amzn/confident-sinkhorn-allocation.