论文标题

使用遗传学和弹性网的混合两层特征选择方法

A Hybrid Two-layer Feature Selection Method Using GeneticAlgorithm and Elastic Net

论文作者

Amini, Fatemeh, Hu, Guiping

论文摘要

特征选择是机器学习的关键预处理步骤,旨在从高维特征空间数据集中确定代表性预测因子,以提高预测准确性。但是,与观测值的数量相比,特征空间维度的增加对许多现有特征选择方法在计算效率和预测性能方面构成了严重挑战。本文提出了一种新的混合两层特征选择方法,该方法结合了包装器和嵌入式方法,用于构建适当的预测变量子集。在提出方法的第一层中,遗传算法(GA)已被用作搜索预测因子的最佳子集的包装器,该算法旨在减少预测变量和预测误差。正如元热疗法之一一样,由于其计算效率而选择了GA。但是,气体不能保证最优性。为了解决这个问题,将第二层添加到提出的方法中,以消除任何剩余的冗余/无关预测因子以提高预测准确性。弹性网(en)被选择为第二层的嵌入式方法,因为它在调整正规化过程和时间效率中的惩罚项方面具有灵活性。这种混合两层方法已应用于NAM群体的玉米遗传数据集上,该数据集由多个数据集组成,该数据集的预测因子数与观测值的数量不同。数值结果证实了所提出模型的优越性。

Feature selection, as a critical pre-processing step for machine learning, aims at determining representative predictors from a high-dimensional feature space dataset to improve the prediction accuracy. However, the increase in feature space dimensionality, comparing to the number of observations, poses a severe challenge to many existing feature selection methods with respect to computational efficiency and prediction performance. This paper presents a new hybrid two-layer feature selection approach that combines a wrapper and an embedded method in constructing an appropriate subset of predictors. In the first layer of the proposed method, the Genetic Algorithm(GA) has been adopted as a wrapper to search for the optimal subset of predictors, which aims to reduce the number of predictors and the prediction error. As one of the meta-heuristic approaches, GA is selected due to its computational efficiency; however, GAs do not guarantee the optimality. To address this issue, a second layer is added to the proposed method to eliminate any remaining redundant/irrelevant predictors to improve the prediction accuracy. Elastic Net(EN) has been selected as the embedded method in the second layer because of its flexibility in adjusting the penalty terms in regularization process and time efficiency. This hybrid two-layer approach has been applied on a Maize genetic dataset from NAM population, which consists of multiple subsets of datasets with different ratio of the number of predictors to the number of observations. The numerical results confirm the superiority of the proposed model.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源