论文标题
有效的SVDD采样,并保证决策边界
Efficient SVDD Sampling with Approximation Guarantees for the Decision Boundary
论文作者
论文摘要
支持向量数据描述(SVDD)是一种流行的单级分类器,用于异常和新颖性检测。但是,尽管具有有效性,但SVDD的扩展并不能随数据尺寸而稳定。为了避免过度的训练时间,采样方法选择了训练数据的小子集,SVDD训练决策边界有望等同于在完整数据集上获得的界限。因此,根据文献,一个好的样本应包含所谓的边界观测值,SVDD将选择作为完整数据集的支持向量。但是,非边界观测对于不碎片连续的嵌入区域并避免分类准确性差也是必不可少的。其他方面(例如选择足够代表性的样本)也很重要。但是现有的抽样方法在很大程度上忽略了它们,从而导致分类准确性差。在本文中,我们研究了如何选择这些要点的样本。我们的方法是将SVDD采样作为优化问题,其中约束确保采样确实近似于原始决策边界。然后,我们提出了快速,有效的算法来解决此优化问题。 Rapid不需要任何参数调整,易于实现,并且可以很好地扩展到大型数据集。我们评估了现实世界和合成数据的方法。到目前为止,我们的评估是SVDD抽样最全面的评估。我们的结果表明,快速的表现在分类精度,样本量和运行时都优于其竞争对手。
Support Vector Data Description (SVDD) is a popular one-class classifiers for anomaly and novelty detection. But despite its effectiveness, SVDD does not scale well with data size. To avoid prohibitive training times, sampling methods select small subsets of the training data on which SVDD trains a decision boundary hopefully equivalent to the one obtained on the full data set. According to the literature, a good sample should therefore contain so-called boundary observations that SVDD would select as support vectors on the full data set. However, non-boundary observations also are essential to not fragment contiguous inlier regions and avoid poor classification accuracy. Other aspects, such as selecting a sufficiently representative sample, are important as well. But existing sampling methods largely overlook them, resulting in poor classification accuracy. In this article, we study how to select a sample considering these points. Our approach is to frame SVDD sampling as an optimization problem, where constraints guarantee that sampling indeed approximates the original decision boundary. We then propose RAPID, an efficient algorithm to solve this optimization problem. RAPID does not require any tuning of parameters, is easy to implement and scales well to large data sets. We evaluate our approach on real-world and synthetic data. Our evaluation is the most comprehensive one for SVDD sampling so far. Our results show that RAPID outperforms its competitors in classification accuracy, in sample size, and in runtime.