论文标题
用于高速流的分布式随机算法主成分分析
Distributed Stochastic Algorithms for High-rate Streaming Principal Component Analysis
论文作者
论文摘要
本文考虑了估计流媒体设置中独立和相同分布的数据样本的协方差矩阵的主要特征向量的问题。许多当代应用程序中的数据流速率可能足够高,以至于单个处理器在新样本到达之前无法完成现有方法的迭代,以进行特征向量估算。本文制定并分析了经典Krasulina方法(D-Krasulina)的分布式变体,该变体可以通过在多个处理节点上分布计算负载来跟上高数据流率。分析表明,在适当的条件下--- d-krasulina以订单的最佳方式收敛到主要特征向量;即,在所有节点中收到$ M $样本后,其估计错误可能为$ O(1/m)$。为了减少网络通信开销,本文还开发和分析了D-Krasulina的微型批次扩展,该扩展称为DM-Krasulina。 DM-Krasulina的分析表明,即使由于通信延迟而必须在网络中丢弃某些样本,它也可以在适当条件下达到订单最佳的估计错误率。最后,对合成和现实世界数据进行实验,以验证在高速流式传输设置中D-Krasulina和DM-Krasulina的收敛行为。
This paper considers the problem of estimating the principal eigenvector of a covariance matrix from independent and identically distributed data samples in streaming settings. The streaming rate of data in many contemporary applications can be high enough that a single processor cannot finish an iteration of existing methods for eigenvector estimation before a new sample arrives. This paper formulates and analyzes a distributed variant of the classical Krasulina's method (D-Krasulina) that can keep up with the high streaming rate of data by distributing the computational load across multiple processing nodes. The analysis shows that---under appropriate conditions---D-Krasulina converges to the principal eigenvector in an order-wise optimal manner; i.e., after receiving $M$ samples across all nodes, its estimation error can be $O(1/M)$. In order to reduce the network communication overhead, the paper also develops and analyzes a mini-batch extension of D-Krasulina, which is termed DM-Krasulina. The analysis of DM-Krasulina shows that it can also achieve order-optimal estimation error rates under appropriate conditions, even when some samples have to be discarded within the network due to communication latency. Finally, experiments are performed over synthetic and real-world data to validate the convergence behaviors of D-Krasulina and DM-Krasulina in high-rate streaming settings.