论文标题
对分散机器学习的梯度跟踪的改进分析
An Improved Analysis of Gradient Tracking for Decentralized Machine Learning
论文作者
论文摘要
我们考虑在网络上分散的机器学习,在网络上,培训数据分布在$ n $代理之间,每个网络都可以在其本地数据上计算随机模型更新。代理商的共同目标是找到一个最小化所有局部损失功能的平均值的模型。虽然梯度跟踪(GT)算法可以克服一个关键挑战,即考虑工人的本地数据分布之间的差异,但GT算法的已知收敛速率在依赖混合参数$ p $的依赖性方面并不是最佳的(与连接Matrix的频谱差距有关)。 我们对随机凸,凸和非凸面设置的GT方法进行了更严格的分析。我们将$ p $从$ \ Mathcal {o}(p^{ - 2})$提高到$ \ m nathcal {o}(p^{ - 1} c^{ - 1})$中的$ \ Mathcal {o}(p^{ - 2})$在无噪声的情况下,以及$ \ \ \ \ \ \ \ \ {o}(p^{ - 3/3/2})$ $ \ MATHCAL {O}(p^{ - 1/2} c^{ - 1})$在一般随机情况下,其中$ c \ geq p $与连接矩阵的负特征值有关(并且在大多数实际应用中是一个常数)。由于新的证明技术可能具有独立的兴趣,因此可以进行这种改进。
We consider decentralized machine learning over a network where the training data is distributed across $n$ agents, each of which can compute stochastic model updates on their local data. The agent's common goal is to find a model that minimizes the average of all local loss functions. While gradient tracking (GT) algorithms can overcome a key challenge, namely accounting for differences between workers' local data distributions, the known convergence rates for GT algorithms are not optimal with respect to their dependence on the mixing parameter $p$ (related to the spectral gap of the connectivity matrix). We provide a tighter analysis of the GT method in the stochastic strongly convex, convex and non-convex settings. We improve the dependency on $p$ from $\mathcal{O}(p^{-2})$ to $\mathcal{O}(p^{-1}c^{-1})$ in the noiseless case and from $\mathcal{O}(p^{-3/2})$ to $\mathcal{O}(p^{-1/2}c^{-1})$ in the general stochastic case, where $c \geq p$ is related to the negative eigenvalues of the connectivity matrix (and is a constant in most practical applications). This improvement was possible due to a new proof technique which could be of independent interest.