论文标题
提高未加工BERT模型的分布式培训性能
Boosting Distributed Training Performance of the Unpadded BERT Model
论文作者
论文摘要
训练前模型是自然语言处理(NLP)的重要工具,而BERT模型是经典的预训练模型,其结构已被追随者广泛采用。它甚至被选为MLPERF培训基准的参考模型。 BERT模型的分布式培训性能优化在加速大多数NLP任务的解决方案方面起着重要作用。 BERT模型通常使用填充量作为其输入,从而导致过多的冗余计算。因此,删除这些冗余计算对于改善分布式训练性能至关重要。 本文设计了一种新的方法,可以有效地培训具有可变长度输入的BERT模型。首先,我们为可变长度BERT模型提出了一个通用结构,并通过我们的分组的多流FMHA(融合多头注意)方法加速编码层。其次,通过数据交换,我们解决了由可变长度输入引起的不平衡工作负载问题,该输入与培训过程高度重叠。最后,我们优化了BERT模型的整体性能,例如内核融合和操作员优化。我们的实验结果表明,我们高度优化的BERT模型可实现最新的吞吐量,并在同一GPU配置中在MLPERF培训v2.0中排名第一。本文的优化可以应用于我们未来的作品中更类似Bert的模型。
Pre-training models are an important tool in Natural Language Processing (NLP), while the BERT model is a classic pre-training model whose structure has been widely adopted by followers. It was even chosen as the reference model for the MLPerf training benchmark. The distributed training performance optimization of BERT models plays an important role in accelerating the solutions of most NLP tasks. BERT model often uses padding tensors as its inputs, leading to excessive redundant computations. Thus, removing these redundant computations is essential to improve the distributed training performance. This paper designs a new approach to train BERT models with variable-length inputs efficiently. Firstly, we propose a general structure for the variable-length BERT models, and accelerate the encoder layer via our grouped multi-stream FMHA (Fused Multi-Head Attention) method. Secondly, through data exchange, we address the unbalanced workload problem caused by the variable-length inputs, which overlaps highly with the training process. Finally, we optimize the overall performance of the BERT model, such as kernel fusion, and operator optimization. Our experimental results show that our highly optimized BERT model achieves state-of-the-art throughput and ranks first in MLPerf Training v2.0 within the same GPU configuration. The optimizations in this paper can be applied to more BERT-like models in our future works.