论文标题
了解广泛神经网络中自然梯度下降的快速收敛的近似Fisher信息
Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks
论文作者
论文摘要
天然梯度下降(NGD)有助于加速梯度下降动力学的收敛性,但是由于其高计算成本,它需要在大规模深神经网络中进行近似。经验研究已经证实,在实践中,一些具有近似Fisher信息的NGD方法足够快地汇聚。然而,从理论的角度来看,尚不清楚为什么以及在什么条件下这种启发式近似效果很好。在这项工作中,我们透露,在特定条件下,具有近似Fisher信息的NGD获得了与精确NGD相同的快速收敛。我们考虑在无限宽度限制中进行深层神经网络,并通过神经切线内核分析功能空间中NGD的渐近训练动力学。在功能空间中,带有近似Fisher信息的训练动力与具有精确的Fisher信息的训练动力相同,并且它们会迅速收敛。快速收敛以层的近似值保持;例如,在块对角线近似中,每个块对应于一个层以及块三基因和k-fac近似。我们还发现,单位近似在某些假设下实现了相同的快速收敛。所有这些不同的近似值在功能空间中具有各向同性梯度,这在实现训练中相同的收敛性方面起着基本作用。因此,当前的研究给出了一个新颖而统一的理论基础,以了解深度学习中的NGD方法。
Natural Gradient Descent (NGD) helps to accelerate the convergence of gradient descent dynamics, but it requires approximations in large-scale deep neural networks because of its high computational cost. Empirical studies have confirmed that some NGD methods with approximate Fisher information converge sufficiently fast in practice. Nevertheless, it remains unclear from the theoretical perspective why and under what conditions such heuristic approximations work well. In this work, we reveal that, under specific conditions, NGD with approximate Fisher information achieves the same fast convergence to global minima as exact NGD. We consider deep neural networks in the infinite-width limit, and analyze the asymptotic training dynamics of NGD in function space via the neural tangent kernel. In the function space, the training dynamics with the approximate Fisher information are identical to those with the exact Fisher information, and they converge quickly. The fast convergence holds in layer-wise approximations; for instance, in block diagonal approximation where each block corresponds to a layer as well as in block tri-diagonal and K-FAC approximations. We also find that a unit-wise approximation achieves the same fast convergence under some assumptions. All of these different approximations have an isotropic gradient in the function space, and this plays a fundamental role in achieving the same convergence properties in training. Thus, the current study gives a novel and unified theoretical foundation with which to understand NGD methods in deep learning.