论文标题
利用机会性元知识来减少自动化机器学习的配置空间
On Taking Advantage of Opportunistic Meta-knowledge to Reduce Configuration Spaces for Automated Machine Learning
论文作者
论文摘要
自动化机器学习(AUTOML)过程可能需要通过不仅机器学习(ML)组件及其超参数的复杂配置空间进行搜索,还需要将它们组合在一起,即形成ML管道。如果此管道配置空间过大,那么固定时间预算可实现的优化效率和模型准确性可实现。一个关键的研究问题是,通过利用其历史表现来完成各种ML任务(即元知识),避免对ML管道差的昂贵评估是否可能既可能又实用。先前的经验以分类器/回归剂精度排名的形式来自(1)(1)在历史自动运行期间进行的大量但无尽的管道评估数量,即“机会性”元知识,或(2)全面的交叉跨validate的分类器评估分类器/回归者对默认超副标的分类器/回归者,以下范围。使用AutoWeka4MCPS软件包进行的许多实验表明,(1)机会性/系统的元知识可以改善ML结果,通常与元知识的相关性一致,并且(2)当配置空间淘汰既不是太保守,也不是太自由度。但是,元知识的效用和影响急性取决于其一代和剥削的许多方面,并保证了广泛的分析;这些通常在汽车和元学习文献中被忽视/不被忽视。特别是,我们观察到对数据集的“挑战”的强烈敏感性,即选择预测变量的特异性是否会导致性能明显更好。最终,确定这样定义的“困难”数据集对于生成信息丰富的元知识基础和理解最佳搜索空间降低策略至关重要。
The automated machine learning (AutoML) process can require searching through complex configuration spaces of not only machine learning (ML) components and their hyperparameters but also ways of composing them together, i.e. forming ML pipelines. Optimisation efficiency and the model accuracy attainable for a fixed time budget suffer if this pipeline configuration space is excessively large. A key research question is whether it is both possible and practical to preemptively avoid costly evaluations of poorly performing ML pipelines by leveraging their historical performance for various ML tasks, i.e. meta-knowledge. The previous experience comes in the form of classifier/regressor accuracy rankings derived from either (1) a substantial but non-exhaustive number of pipeline evaluations made during historical AutoML runs, i.e. 'opportunistic' meta-knowledge, or (2) comprehensive cross-validated evaluations of classifiers/regressors with default hyperparameters, i.e. 'systematic' meta-knowledge. Numerous experiments with the AutoWeka4MCPS package suggest that (1) opportunistic/systematic meta-knowledge can improve ML outcomes, typically in line with how relevant that meta-knowledge is, and (2) configuration-space culling is optimal when it is neither too conservative nor too radical. However, the utility and impact of meta-knowledge depend critically on numerous facets of its generation and exploitation, warranting extensive analysis; these are often overlooked/underappreciated within AutoML and meta-learning literature. In particular, we observe strong sensitivity to the `challenge' of a dataset, i.e. whether specificity in choosing a predictor leads to significantly better performance. Ultimately, identifying `difficult' datasets, thus defined, is crucial to both generating informative meta-knowledge bases and understanding optimal search-space reduction strategies.