论文标题
线性模型的最小值速率缺失值
Minimax rate of consistency for linear models with missing values
论文作者
论文摘要
由于多个来源的汇总和本质上缺失的信息(传感器故障,调查中的未解决问题),在大多数实际数据集中都会出现缺失值。实际上,缺少值的本质通常会阻止我们运行标准学习算法。在本文中,我们专注于广泛研究的线性模型,但是在缺少价值观的情况下,事实证明这是一项艰巨的任务。实际上,可以将贝叶斯规则分解为与每个缺失模式相对应的预测变量的总和。最终,这需要解决多个学习任务,这是输入功能数量的指数,这使得当前现实世界数据集的预测不可能。首先,我们提出了一个严格的设置,以分析最小二乘类型的估计器,并建立对多余风险的约束,该风险在维度中成倍增加。因此,我们利用丢失的数据分布提出了一种新的算法,并且相关的自适应风险范围是最小的。与用于缺少值的预测的最新算法相比,数值实验强调了我们方法的好处。
Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys...). In fact, the very nature of missing values usually prevents us from running standard learning algorithms. In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which turns out to be quite a challenging task. Indeed, the Bayes rule can be decomposed as a sum of predictors corresponding to each missing pattern. This eventually requires to solve a number of learning tasks, exponential in the number of input features, which makes predictions impossible for current real-world datasets. First, we propose a rigorous setting to analyze a least-square type estimator and establish a bound on the excess risk which increases exponentially in the dimension. Consequently, we leverage the missing data distribution to propose a new algorithm, andderive associated adaptive risk bounds that turn out to be minimax optimal. Numerical experiments highlight the benefits of our method compared to state-of-the-art algorithms used for predictions with missing values.