论文标题

半监督学习的功能排名

Feature Ranking for Semi-supervised Learning

论文作者

Petković, Matej, Džeroski, Sašo, Kocev, Dragi

论文摘要

可用于分析的数据越来越复杂:高维度,示例数量和标签量。这对现有的机器学习方法构成了各种挑战:应对数据集的应对,并在高维空间中描述了大量示例,而并非所有示例都提供了标签。例如,在研究化合物的毒性时,有很多可用的化合物,可以用丰富的高维表示,但并非所有化合物都具有有关其毒性的信息。为了应对这些挑战,我们建议对功能排名的半监督学习。特征排名是在分类和回归的背景下以及结构化输出预测(多标签分类,分层多标签分类和多目标回归)的上下文中学习的。据我们所知,这是第一项处理在半监督结构化的输出预测上下文中特征排名的任务。更具体地说,我们提出了两种基于Tree Emembles和Relief算法家族的方法。在38个基准数据集中进行的广泛评估揭示了以下内容:随机森林在类似分类的任务方面表现最好,而对于类似回归的任务而言,超级PCT执行了最有效的随机森林是考虑到所有任务的最有效方法,并且在所有任务中的归纳时间,半手不足的特征特征的特征排名越来越多地遍及大型任务。

The data made available for analysis are becoming more and more complex along several directions: high dimensionality, number of examples and the amount of labels per example. This poses a variety of challenges for the existing machine learning methods: coping with dataset with a large number of examples that are described in a high-dimensional space and not all examples have labels provided. For example, when investigating the toxicity of chemical compounds there are a lot of compounds available, that can be described with information rich high-dimensional representations, but not all of the compounds have information on their toxicity. To address these challenges, we propose semi-supervised learning of feature ranking. The feature rankings are learned in the context of classification and regression as well as in the context of structured output prediction (multi-label classification, hierarchical multi-label classification and multi-target regression). To the best of our knowledge, this is the first work that treats the task of feature ranking within the semi-supervised structured output prediction context. More specifically, we propose two approaches that are based on tree ensembles and the Relief family of algorithms. The extensive evaluation across 38 benchmark datasets reveals the following: Random Forests perform the best for the classification-like tasks, while for the regression-like tasks Extra-PCTs perform the best, Random Forests are the most efficient method considering induction times across all tasks, and semi-supervised feature rankings outperform their supervised counterpart across a majority of the datasets from the different tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源