论文标题
使用公正的kullback-leibler风险估计,低率矩阵Denoising用于计数数据
Low-rank matrix denoising for count data using unbiased Kullback-Leibler risk estimation
论文作者
论文摘要
许多统计研究涉及以矩阵形式组织的观察结果的分析,其元素是计数数据。当假定这些观察结果遵循泊松或多项式分布时,关注强度矩阵(Poisson案例)或组成矩阵(多项式情况)的估计是值得关注的。在这种情况下,建议构建一个估算器,以最大程度地限制正规对数模可能性通过核标准惩罚。这样的方法很容易产生一个低级别的矩阵值估计器,其正条目属于多项式情况下的一组行式矩阵。然后,作为主要贡献,构建了一个数据驱动的程序,以在此类模型中最大程度地减少(近似)对Kullback-Leibler(KL)风险的无偏估计来选择此类估计量的正则化参数,从而推广Stein的无偏见的风险估计最初是为高斯数据提出的。对这些数量的评估是一个微妙的问题,引入了新的方法,以获得这种无偏估计的准确数值近似。模拟数据用于验证这种方式,从计数数据中选择正规化参数以进行低级矩阵估计。对于多项式分布之后的数据,该方法的性能也与$ k $倍的交叉验证进行了比较。调查研究和宏基因组学的示例还说明了这种方法对实际数据分析的好处。
Many statistical studies are concerned with the analysis of observations organized in a matrix form whose elements are count data. When these observations are assumed to follow a Poisson or a multinomial distribution, it is of interest to focus on the estimation of either the intensity matrix (Poisson case) or the compositional matrix (multinomial case) when it is assumed to have a low rank structure. In this setting, it is proposed to construct an estimator minimizing the regularized negative log-likelihood by a nuclear norm penalty. Such an approach easily yields a low-rank matrix-valued estimator with positive entries which belongs to the set of row-stochastic matrices in the multinomial case. Then, as a main contribution, a data-driven procedure is constructed to select the regularization parameter in the construction of such estimators by minimizing (approximately) unbiased estimates of the Kullback-Leibler (KL) risk in such models, which generalize Stein's unbiased risk estimation originally proposed for Gaussian data. The evaluation of these quantities is a delicate problem, and novel methods are introduced to obtain accurate numerical approximation of such unbiased estimates. Simulated data are used to validate this way of selecting regularizing parameters for low-rank matrix estimation from count data. For data following a multinomial distribution, the performances of this approach are also compared to $K$-fold cross-validation. Examples from a survey study and metagenomics also illustrate the benefits of this methodology for real data analysis.