改善低维随机基础的神经网络训练

论文标题

改善低维随机基础的神经网络训练

Improving Neural Network Training in Low Dimensional Random Bases

论文作者

Gressmann, Frithjof, Eaton-Rosen, Zach, Luschi, Carlo

论文摘要

事实证明，随机梯度下降（SGD）非常有效地优化采用大量参数数量的深层神经网络。然而，提高大规模优化的效率仍然是研究的重要领域。最近的工作表明，可以在比其本地参数空间小得多的随机标记子空间中优化深神经网络。尽管此类培训有望获得更有效，更可扩展的优化方案，但其实际应用受到较低优化性能的限制。在这里，我们改进了最近的随机子空间方法，如下所示：首先，我们表明在整个训练中保持随机投影是不利于优化的。我们建议在每个步骤中重新绘制随机子空间，从而产生更好的性能。我们通过将独立的预测应用于网络的不同部分，从而实现进一步的改进，从而使近似值随着网络维度的增长而更加有效。为了实现这些实验，我们利用硬件加速的伪随机数生成在每个优化步骤中按需构建随机投影，从而使我们能够在具有共享随机种子的多个工人之间分配独立的随机指示。这可以大大减少内存，相关工作负载的速度最高可快10倍。

Stochastic Gradient Descent (SGD) has proven to be remarkably effective in optimizing deep neural networks that employ ever-larger numbers of parameters. Yet, improving the efficiency of large-scale optimization remains a vital and highly active area of research. Recent work has shown that deep neural networks can be optimized in randomly-projected subspaces of much smaller dimensionality than their native parameter space. While such training is promising for more efficient and scalable optimization schemes, its practical application is limited by inferior optimization performance. Here, we improve on recent random subspace approaches as follows: Firstly, we show that keeping the random projection fixed throughout training is detrimental to optimization. We propose re-drawing the random subspace at each step, which yields significantly better performance. We realize further improvements by applying independent projections to different parts of the network, making the approximation more efficient as network dimensionality grows. To implement these experiments, we leverage hardware-accelerated pseudo-random number generation to construct the random projections on-demand at every optimization step, allowing us to distribute the computation of independent random directions across multiple workers with shared random seeds. This yields significant reductions in memory and is up to 10 times faster for the workloads in question.

下载PDF全文

下载文献需遵守相关版权规定

论文标题