论文标题
使用属性的联合概率分布的近似值的贝叶斯分类
Bayes Classification using an approximation to the Joint Probability Distribution of the Attributes
论文作者
论文摘要
Naive-Bayes分类器由于其简单,速度和准确性而被广泛使用。但是,当在测试样本中至少一个属性值时,这种方法失败了,没有具有该属性值的相应训练样本。这被称为零频率问题,通常使用Laplace平滑来解决。但是,Laplace平滑没有考虑到测试样本的属性值附近的统计特征。高斯天真的贝叶斯解决了这一点,但是由此产生的高斯模型是由全球信息形成的。相反,我们提出了一种方法,该方法使用测试样本附近的信息估算条件概率。在这种情况下,我们不再需要对属性值独立性进行假设,因此考虑在给定类别条件的关节概率分布,这意味着我们的方法(与高斯和拉普拉斯方法不同)在属性值之间考虑了依赖关系。我们说明了拟议方法在加利福尼亚大学尔湾分校(UCI)机器学习存储库中的各种数据集上的性能。我们还包括$ K $ -NN分类器的结果,并证明所提出的方法是简单,强大且优于标准方法。
The Naive-Bayes classifier is widely used due to its simplicity, speed and accuracy. However this approach fails when, for at least one attribute value in a test sample, there are no corresponding training samples with that attribute value. This is known as the zero frequency problem and is typically addressed using Laplace Smoothing. However, Laplace Smoothing does not take into account the statistical characteristics of the neighbourhood of the attribute values of the test sample. Gaussian Naive Bayes addresses this but the resulting Gaussian model is formed from global information. We instead propose an approach that estimates conditional probabilities using information in the neighbourhood of the test sample. In this case we no longer need to make the assumption of independence of attribute values and hence consider the joint probability distribution conditioned on the given class which means our approach (unlike the Gaussian and Laplace approaches) takes into consideration dependencies among the attribute values. We illustrate the performance of the proposed approach on a wide range of datasets taken from the University of California at Irvine (UCI) Machine Learning Repository. We also include results for the $k$-NN classifier and demonstrate that the proposed approach is simple, robust and outperforms standard approaches.