深入大型模型培训

论文标题

深入大型模型培训

Dive into Big Model Training

论文作者

Liu, Qinghua, Jiang, Yuxiang

论文摘要

模型大小的规模不断提高和绩效的持续改进预示着大型模型时代的到来。在本报告中，我们通过研究培训目标和培训方法来探讨大型模型培训的运作方式。具体而言，培训目标描述了如何利用网络尺度数据来开发基于自我监管的学习以及基于分布式培训的培训方法的极具功能和令人难以置信的大型模型，描述了如何使大型模型培训成为现实。我们将现有的培训方法总结为三个主要类别：训练并行性，节省记忆技术和模型稀疏设计。训练并行性可以根据发生的并行性维度分类为数据，管道和张量并行性。节省记忆的技术是正交的，并且与训练并行性互补。和模型稀疏设计以恒定的计算成本进一步扩大模型大小。在https://github.com/qhliu26/bm-training提供了不断更新的大型模型培训列表。

The increasing scale of model size and continuous improvement of performance herald the arrival of the Big Model era. In this report, we explore what and how the big model training works by diving into training objectives and training methodologies. Specifically,training objectives describe how to leverage web-scale data to develop extremely capable and incredibly large models based on self-supervised learning, and training methodologies which are based on distributed training describe how to make big model training a reality. We summarize the existing training methodologies into three main categories: training parallelism, memory-saving technologies, and model sparsity design. Training parallelism can be categorized into data, pipeline, and tensor parallelism according to the dimension of parallelism that takes place. Memory-saving technologies are orthogonal and complementary to training parallelism. And model sparsity design further scales up the model size with a constant computational cost. A continuously updated paper list of big model training is provided at https://github.com/qhliu26/BM-Training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题