揭穿概括错误或：我如何学会停止担心和爱我的训练集

论文标题

揭穿概括错误或：我如何学会停止担心和爱我的训练集

Debunking Generalization Error or: How I Learned to Stop Worrying and Love My Training Set

论文作者

Acquaviva, Viviana, Lovell, Chistopher, Ishida, Emille

论文摘要

我们旨在使用有监督的机器学习方法来确定远处星系的某些物理特性（例如，恒星质量，恒星形成史或化学富集历史）。我们知道，不同的天体物理过程在光谱的各个区域都带有特征性特征。不幸的是，确定该问题的培训集非常困难，因为标签不容易获得 - 我们无法了解星系如何形成的真实历史。解决此问题的一种可能方法是在最先进的宇宙学模拟上训练机器学习模型。但是，当对仿真训练算法时，尚不清楚它们一旦应用于真实数据，它们的表现如何。在本文中，我们试图将概括误差与源域和应用域之间适当距离的适当度量进行建模。我们的目标是获得可靠的估计，以了解对模拟训练的模型如何在数据上行事。

We aim to determine some physical properties of distant galaxies (for example, stellar mass, star formation history, or chemical enrichment history) from their observed spectra, using supervised machine learning methods. We know that different astrophysical processes leave their imprint in various regions of the spectra with characteristic signatures. Unfortunately, identifying a training set for this problem is very hard, because labels are not readily available - we have no way of knowing the true history of how galaxies have formed. One possible approach to this problem is to train machine learning models on state-of-the-art cosmological simulations. However, when algorithms are trained on the simulations, it is unclear how well they will perform once applied to real data. In this paper, we attempt to model the generalization error as a function of an appropriate measure of distance between the source domain and the application domain. Our goal is to obtain a reliable estimate of how a model trained on simulations might behave on data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题