半监督的学习目标是数据策划的生成模型中的对数类样的目标

论文标题

半监督的学习目标是数据策划的生成模型中的对数类样的目标

Semi-supervised learning objectives as log-likelihoods in a generative model of data curation

论文作者

Ganev, Stoil, Aitchison, Laurence

论文摘要

目前，我们尚不了解半监督学习（SSL）的目标，例如伪标记和熵最小化，因为原木样式是对数的类似物，这排除了例如的发展。贝叶斯SSL。在这里，我们注意到，基准图像数据集（例如CIFAR-10）是经过精心策划的，我们将SSL目标作为对数可能的数据策划模型，该模型最初是为了解释寒冷后代效应而开发的（Aitchison 2020）。从熵最小化和伪标签到类似于FixMatch的最新技术，可以将SSL方法理解为我们原则上的对数可能性的较低限制。因此，我们能够在玩具数据上为贝叶斯SSL提供原理证明。最后，我们的理论表明，由于数据策划引起的统计模式，SSL的有效部分是有效的。这提供了过去结果的解释，该结果表明SSL在干净的数据集上的性能更好，没有任何“分发”示例。确认这些结果我们发现，使用基于Galaxy Zoo 2的匹配的策划和未经切割的数据集，SSL比未经切割的数据给出了比未经切割的数据更大的性能改进。

We currently do not have an understanding of semi-supervised learning (SSL) objectives such as pseudo-labelling and entropy minimization as log-likelihoods, which precludes the development of e.g. Bayesian SSL. Here, we note that benchmark image datasets such as CIFAR-10 are carefully curated, and we formulate SSL objectives as a log-likelihood in a generative model of data curation that was initially developed to explain the cold-posterior effect (Aitchison 2020). SSL methods, from entropy minimization and pseudo-labelling, to state-of-the-art techniques similar to FixMatch can be understood as lower-bounds on our principled log-likelihood. We are thus able to give a proof-of-principle for Bayesian SSL on toy data. Finally, our theory suggests that SSL is effective in part due to the statistical patterns induced by data curation. This provides an explanation of past results which show SSL performs better on clean datasets without any "out of distribution" examples. Confirming these results we find that SSL gave much larger performance improvements on curated than on uncurated data, using matched curated and uncurated datasets based on Galaxy Zoo 2.

下载PDF全文

下载文献需遵守相关版权规定

论文标题