论文标题
最大值类似数据的最佳分布次采样,并具有大量数据
Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators with Massive Data
论文作者
论文摘要
非均匀的亚采样方法可有效减轻计算负担并维持大量数据的估计效率。现有方法主要集中于替换的次采样,其计算效率很高。如果数据量如此之大,以至于不能一次计算非均匀的亚采样概率,则可以实现用更换的子采样。本文使用泊松子采样解决了这个问题。我们首先在A级和L型标准下的准类估计的背景下得出了最佳的泊松子采样概率。对于具有近似最佳亚采样概率的实际实现算法,我们建立了所得估计器的一致性和渐近态性。为了处理完整数据存储在不同块或多个位置的情况,我们开发了一个分布式的亚采样框架,其中在完整数据的较小分区中同时计算统计信息。研究了所得聚集估计量的渐近特性。我们通过模拟和真实数据集的数值实验来说明和评估提出的策略。
Nonuniform subsampling methods are effective to reduce computational burden and maintain estimation efficiency for massive data. Existing methods mostly focus on subsampling with replacement due to its high computational efficiency. If the data volume is so large that nonuniform subsampling probabilities cannot be calculated all at once, then subsampling with replacement is infeasible to implement. This paper solves this problem using Poisson subsampling. We first derive optimal Poisson subsampling probabilities in the context of quasi-likelihood estimation under the A- and L-optimality criteria. For a practically implementable algorithm with approximated optimal subsampling probabilities, we establish the consistency and asymptotic normality of the resultant estimators. To deal with the situation that the full data are stored in different blocks or at multiple locations, we develop a distributed subsampling framework, in which statistics are computed simultaneously on smaller partitions of the full data. Asymptotic properties of the resultant aggregated estimator are investigated. We illustrate and evaluate the proposed strategies through numerical experiments on simulated and real data sets.