论文标题

联合合奏

Collegial Ensembles

论文作者

Littwin, Etai, Myara, Ben, Sabah, Sima, Susskind, Joshua, Zhai, Shuangfei, Golan, Oren

论文摘要

现代神经网络性能通常会随着模型大小的增加而提高。最近对过度参数化网络的神经切线内核(NTK)的研究线表明,随着尺寸的增加而改善是条件损失景观更好的产物。在这项工作中,我们调查了通过结合结构实现的一种过度参数化的形式,我们将合作团(CE)定义为具有相同体系结构的多个独立模型的聚合,被培训为单个模型。我们表明,当合奏中的模型数量很大,类似于宽模型的动力学时,CE的优化动力学显着简化了,但比较更加有利。我们使用有关NTK的有限宽度校正的最新理论结果,以在有限宽度CE的空间中执行有效的体系结构搜索,旨在最大程度地降低容量,或在一组约束下最大程度地提高训练性。可以使用小组卷积和阻断对角线层的实践体系结构中有效实现了结果的集合。最后,我们展示了如何使用框架来分析最初使用昂贵的网格搜索发现的最佳组卷积模块,而无需训练单个模型。

Modern neural network performance typically improves as model size increases. A recent line of research on the Neural Tangent Kernel (NTK) of over-parameterized networks indicates that the improvement with size increase is a product of a better conditioned loss landscape. In this work, we investigate a form of over-parameterization achieved through ensembling, where we define collegial ensembles (CE) as the aggregation of multiple independent models with identical architectures, trained as a single model. We show that the optimization dynamics of CE simplify dramatically when the number of models in the ensemble is large, resembling the dynamics of wide models, yet scale much more favorably. We use recent theoretical results on the finite width corrections of the NTK to perform efficient architecture search in a space of finite width CE that aims to either minimize capacity, or maximize trainability under a set of constraints. The resulting ensembles can be efficiently implemented in practical architectures using group convolutions and block diagonal layers. Finally, we show how our framework can be used to analytically derive optimal group convolution modules originally found using expensive grid searches, without having to train a single model.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源