开放采样：探索重新平衡长尾数据集的分布数据集

论文标题

开放采样：探索重新平衡长尾数据集的分布数据集

Open-Sampling: Exploring Out-of-Distribution data for Re-balancing Long-tailed datasets

论文作者

Wei, Hongxin, Tao, Lue, Xie, Renchunzi, Feng, Lei, An, Bo

论文摘要

当训练数据集患有极端阶级失衡时，深度神经网络通常会表现不佳。最近的研究发现，以半监督的方式直接使用分布外数据（即开放式样本）培训将损害概括性能。在这项工作中，我们从理论上表明，分发数据仍然可以利用以从贝叶斯的角度增加少数群体。基于这种动机，我们提出了一种称为开放采样的新方法，该方法利用开放式嘈杂标签重新平衡培训数据集的班级先验。对于每个开放式实例，标签是从我们的预定义分布中取样的，该分布是互补的，与原始类先验的分布互补。我们从经验上表明，开放采样不仅重新平衡了阶级先验，还鼓励神经网络学习可分离的表示。广泛的实验表明，我们提出的方法显着优于现有数据重新平衡方法，并可以提高现有最新方法的性能。

Deep neural networks usually perform poorly when the training dataset suffers from extreme class imbalance. Recent studies found that directly training with out-of-distribution data (i.e., open-set samples) in a semi-supervised manner would harm the generalization performance. In this work, we theoretically show that out-of-distribution data can still be leveraged to augment the minority classes from a Bayesian perspective. Based on this motivation, we propose a novel method called Open-sampling, which utilizes open-set noisy labels to re-balance the class priors of the training dataset. For each open-set instance, the label is sampled from our pre-defined distribution that is complementary to the distribution of original class priors. We empirically show that Open-sampling not only re-balances the class priors but also encourages the neural network to learn separable representations. Extensive experiments demonstrate that our proposed method significantly outperforms existing data re-balancing methods and can boost the performance of existing state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题