论文标题
伪标记如何影响半监督Gibbs算法的概括误差?
How Does Pseudo-Labeling Affect the Generalization Error of the Semi-Supervised Gibbs Algorithm?
论文作者
论文摘要
我们通过Gibbs算法提供了半监督学习(SSL)的预期概括误差(Gen-Error)的精确表征。 Gen-Error用输出假设,伪标记的数据集和标记的数据集之间的对称KL信息表示。也可以获得Gen-Error上的无分布上限和下限。我们的发现提供了新的见解,即SSL具有伪标记的概括性能不仅受到输出假设和输入训练数据之间的信息影响,还受到{\ em labeLed}和{\ em em pseudo-pseudo-pseudo-labeled}数据样本之间的信息{\ em shared}的信息。这是从给定方法家族中选择适当的伪标记方法的指南。为了加深我们的理解,我们进一步探讨了两个例子 - 平均估计和逻辑回归。特别是,我们分析了在两种情况下,未标记的数据$λ$的数量$λ$的比率如何影响Gen-Error。随着$λ$的增加,用于平均估计的Gen-Error以比所有样品被标记时大的值降低,然后饱和,并且可以通过我们的分析来量化间隙{\ em恰好},并且依赖于\ emph {cross-emph {cross-covariance}之间的标记和pseudo-pseudo-labeledo-labeLiance}。对于逻辑回归,随着$λ$的增加,多余风险的Gen-Error和差异成分也会降低。
We provide an exact characterization of the expected generalization error (gen-error) for semi-supervised learning (SSL) with pseudo-labeling via the Gibbs algorithm. The gen-error is expressed in terms of the symmetrized KL information between the output hypothesis, the pseudo-labeled dataset, and the labeled dataset. Distribution-free upper and lower bounds on the gen-error can also be obtained. Our findings offer new insights that the generalization performance of SSL with pseudo-labeling is affected not only by the information between the output hypothesis and input training data but also by the information {\em shared} between the {\em labeled} and {\em pseudo-labeled} data samples. This serves as a guideline to choose an appropriate pseudo-labeling method from a given family of methods. To deepen our understanding, we further explore two examples -- mean estimation and logistic regression. In particular, we analyze how the ratio of the number of unlabeled to labeled data $λ$ affects the gen-error under both scenarios. As $λ$ increases, the gen-error for mean estimation decreases and then saturates at a value larger than when all the samples are labeled, and the gap can be quantified {\em exactly} with our analysis, and is dependent on the \emph{cross-covariance} between the labeled and pseudo-labeled data samples. For logistic regression, the gen-error and the variance component of the excess risk also decrease as $λ$ increases.