论文标题

RCD-SGD:通过subsodular分区在异质环境中资源受限的分布式SGD

RCD-SGD: Resource-Constrained Distributed SGD in Heterogeneous Environment via Submodular Partitioning

论文作者

He, Haoze, Dube, Parijat

论文摘要

基于SGD的分布式培训算法的收敛与工人之间的数据分布相关。标准分区技术试图实现与总数据集成比例的每级人口分布的平等分区。具有相同总体规模甚至相同数量的每个类样本的分区仍可能在特征空间中具有非IID分布。在异质计算环境中,当设备具有不同的计算功能时,跨设备尺寸的分区可能会导致分布式SGD中的Straggler问题。我们基于涉及下二键优化的新型数据分配算法开发了在异质环境中分布式SGD的框架。我们的数据分配算法明确说明了工人之间资源异质性,同时实现了类似的班级特征分布并保持班级平衡。基于该算法,我们开发了一个分布式的SGD框架,该框架可以将现有的SOTA分布式培训算法加速高达32%。

The convergence of SGD based distributed training algorithms is tied to the data distribution across workers. Standard partitioning techniques try to achieve equal-sized partitions with per-class population distribution in proportion to the total dataset. Partitions having the same overall population size or even the same number of samples per class may still have Non-IID distribution in the feature space. In heterogeneous computing environments, when devices have different computing capabilities, even-sized partitions across devices can lead to the straggler problem in distributed SGD. We develop a framework for distributed SGD in heterogeneous environments based on a novel data partitioning algorithm involving submodular optimization. Our data partitioning algorithm explicitly accounts for resource heterogeneity across workers while achieving similar class-level feature distribution and maintaining class balance. Based on this algorithm, we develop a distributed SGD framework that can accelerate existing SOTA distributed training algorithms by up to 32%.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源