可扩展和实用的自然梯度，用于大规模深度学习

论文标题

可扩展和实用的自然梯度，用于大规模深度学习

Scalable and Practical Natural Gradient for Large-Scale Deep Learning

论文作者

Osawa, Kazuki, Tsuji, Yohei, Ueno, Yuichiro, Naruse, Akira, Foo, Chuan-Sheng, Yokota, Rio

论文摘要

深度神经网络的大规模分布训练导致由于有效的迷你批量增加而导致的概括性能较差的模型。先前的方法试图通过改变时期和层上的学习率和批处理大小来解决此问题，或者对批处理标准化的临时修改。我们提出了可扩展和实用的自然梯度下降（SP-NGD），这是一种用于训练模型的原则方法，使他们能够达到与使用一阶优化方法训练但加速收敛的模型相似的概括性能。此外，与一阶方法相比，SP-NGD量表具有可忽略不计的计算开销的大型迷你批量。我们在基准任务上评估了SP-NGD，其中高度优化的一阶方法可作为参考：训练Resnet-50模型，用于ImageNet上的图像分类。我们使用32,768的小批量（1,024 GPU）在5.5分钟内融合了75.4％的TOP-1验证精度，以及在873 SP-NGD的873步骤中，精度为74.9％，微型批量极高的131,072。

Large-scale distributed training of deep neural networks results in models with worse generalization performance as a result of the increase in the effective mini-batch size. Previous approaches attempt to address this problem by varying the learning rate and batch size over epochs and layers, or ad hoc modifications of batch normalization. We propose Scalable and Practical Natural Gradient Descent (SP-NGD), a principled approach for training models that allows them to attain similar generalization performance to models trained with first-order optimization methods, but with accelerated convergence. Furthermore, SP-NGD scales to large mini-batch sizes with a negligible computational overhead as compared to first-order methods. We evaluated SP-NGD on a benchmark task where highly optimized first-order methods are available as references: training a ResNet-50 model for image classification on ImageNet. We demonstrate convergence to a top-1 validation accuracy of 75.4% in 5.5 minutes using a mini-batch size of 32,768 with 1,024 GPUs, as well as an accuracy of 74.9% with an extremely large mini-batch size of 131,072 in 873 steps of SP-NGD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题