论文标题
扩展非凸优化的随机梯度下降
Scaling up Stochastic Gradient Descent for Non-convex Optimisation
论文作者
论文摘要
随机梯度下降(SGD)是一种广泛采用的迭代方法,用于优化可区分的目标函数。在本文中,我们提出并讨论了一种新颖的方法,以扩展涉及非凸功能和大型数据集的应用中的SGD。我们解决使用共享和分布式内存时出现的瓶颈问题。通常,前者受到有限的计算资源和带宽的限制,而后者则遭受了通信开销的损害。我们提出了SGD(命名为DPSGD)的统一分布式和并行实现,该实现既依赖异步分布和无锁并行性。通过将两种策略结合到一个统一的框架中,DPSGD能够在本地计算和通信之间进行更好的权衡。研究了DPSGD的收敛属性,以解决非凸问题,例如在统计建模和机器学习中引起的问题。我们的理论分析表明,DPSGD会导致对核心的数量和工人数量的加速,同时保证$ O(1/\ sqrt {t})$的渐近收敛速率给定核心的数量由$ t^{1/4} $限制在$ t^$ t^$ t^$ t^$ t^$ t^py的情况下。 DPSGD可以实现的潜在收益在随机变异推理问题(潜在的差异分配)和深度增强学习(DRL)问题(Advantage Actor -A2C -A2C)上得到了经验证明,导致两种算法:DPSVI和HSA2C。经验结果证明了我们的理论发现。进行了比较研究以显示针对最先进的DRL算法的拟议DPSGD的性能。
Stochastic gradient descent (SGD) is a widely adopted iterative method for optimizing differentiable objective functions. In this paper, we propose and discuss a novel approach to scale up SGD in applications involving non-convex functions and large datasets. We address the bottleneck problem arising when using both shared and distributed memory. Typically, the former is bounded by limited computation resources and bandwidth whereas the latter suffers from communication overheads. We propose a unified distributed and parallel implementation of SGD (named DPSGD) that relies on both asynchronous distribution and lock-free parallelism. By combining two strategies into a unified framework, DPSGD is able to strike a better trade-off between local computation and communication. The convergence properties of DPSGD are studied for non-convex problems such as those arising in statistical modelling and machine learning. Our theoretical analysis shows that DPSGD leads to speed-up with respect to the number of cores and number of workers while guaranteeing an asymptotic convergence rate of $O(1/\sqrt{T})$ given that the number of cores is bounded by $T^{1/4}$ and the number of workers is bounded by $T^{1/2}$ where $T$ is the number of iterations. The potential gains that can be achieved by DPSGD are demonstrated empirically on a stochastic variational inference problem (Latent Dirichlet Allocation) and on a deep reinforcement learning (DRL) problem (advantage actor critic - A2C) resulting in two algorithms: DPSVI and HSA2C. Empirical results validate our theoretical findings. Comparative studies are conducted to show the performance of the proposed DPSGD against the state-of-the-art DRL algorithms.