大规模培训推荐系统：沟通高效的模型和数据并行性

论文标题

大规模培训推荐系统：沟通高效的模型和数据并行性

Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism

论文作者

Gupta, Vipul, Choudhary, Dhruv, Tang, Ping Tak Peter, Wei, Xiaohan, Wang, Xing, Huang, Yuzhen, Kejariwal, Arun, Ramchandran, Kannan, Mahoney, Michael W.

论文摘要

在本文中，我们将混合平行性（一种使用数据并行性（DP）和模型并行性（MP）（MP）（MP）的范式来扩展大型建议模型的分布训练。我们提出了一个称为动态通信阈值（DCT）的压缩框架，用于沟通效率的混合动力训练。 DCT通过简单的硬质阈值函数过滤了要在整个网络上传达的实体，仅允许最相关的信息通过。对于通信有效的DP，DCT在模型同步期间压缩发送到参数服务器的参数梯度。阈值仅更新一次每千次迭代，以减少压缩的计算开销。对于沟通有效的MP，DCT结合了一种新型技术，以分别在向前和向后传播过程中分别压缩整个网络中发送的激活和梯度。这是通过仅针对数据中的每个培训样本来识别和更新神经网络中最相关的神经元来完成。我们评估了DCT对公开可用的自然语言处理以及推荐模型和数据集以及在Facebook生产中使用的建议系统。 DCT分别在DP和MP期间将通信至少减少$ 100 \ times $和$ 20 \ times $。该算法已在生产中部署，它将最先进的工业推荐模型的端到端培训时间提高了37 \％，而绩效却没有任何损失。

In this paper, we consider hybrid parallelism -- a paradigm that employs both Data Parallelism (DP) and Model Parallelism (MP) -- to scale distributed training of large recommendation models. We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT filters the entities to be communicated across the network through a simple hard-thresholding function, allowing only the most relevant information to pass through. For communication efficient DP, DCT compresses the parameter gradients sent to the parameter server during model synchronization. The threshold is updated only once every few thousand iterations to reduce the computational overhead of compression. For communication efficient MP, DCT incorporates a novel technique to compress the activations and gradients sent across the network during the forward and backward propagation, respectively. This is done by identifying and updating only the most relevant neurons of the neural network for each training sample in the data. We evaluate DCT on publicly available natural language processing and recommender models and datasets, as well as recommendation systems used in production at Facebook. DCT reduces communication by at least $100\times$ and $20\times$ during DP and MP, respectively. The algorithm has been deployed in production, and it improves end-to-end training time for a state-of-the-art industrial recommender model by 37\%, without any loss in performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题