扩展法律超出反向传播

论文标题

扩展法律超出反向传播

Scaling Laws Beyond Backpropagation

论文作者

Filipovich, Matthew J., Cappelli, Alessandro, Hesslow, Daniel, Launay, Julien

论文摘要

长期以来一直研究了反向传播的替代方案，以更好地了解生物学大脑的学习方式。最近，他们还引起了人们对更有效培训神经网络的一种兴趣。通过放松反向传播固有的约束（例如，对称的进料和反馈权重，顺序更新），这些方法可以实现有希望的前景，例如本地学习。但是，在最终任务绩效，收敛速度以及最终计算和数据要求方面，不同方法之间的权衡很少被概述。在这项工作中，我们使用缩放定律来研究直接反馈对准〜（DFA）有效训练仅因果解码器的能力。缩放定律概述了建模决策所隐含的权衡，以推断如何转移到越来越大的模型。我们发现，DFA无法提供比反向传播更有效的扩展：从来没有一个政权使用DFA产生的损失降级值得降低计算预算的潜在降低。我们的发现与替代培训方法社区的先前信念有所不同，并强调了对整体经验方法的需求，以更好地理解建模决策。

Alternatives to backpropagation have long been studied to better understand how biological brains may learn. Recently, they have also garnered interest as a way to train neural networks more efficiently. By relaxing constraints inherent to backpropagation (e.g., symmetric feedforward and feedback weights, sequential updates), these methods enable promising prospects, such as local learning. However, the tradeoffs between different methods in terms of final task performance, convergence speed, and ultimately compute and data requirements are rarely outlined. In this work, we use scaling laws to study the ability of Direct Feedback Alignment~(DFA) to train causal decoder-only Transformers efficiently. Scaling laws provide an overview of the tradeoffs implied by a modeling decision, up to extrapolating how it might transfer to increasingly large models. We find that DFA fails to offer more efficient scaling than backpropagation: there is never a regime for which the degradation in loss incurred by using DFA is worth the potential reduction in compute budget. Our finding comes at variance with previous beliefs in the alternative training methods community, and highlights the need for holistic empirical approaches to better understand modeling decisions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题