论文标题
GPU数据中心中的深度学习工作负载计划:分类学,挑战和愿景
Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision
论文作者
论文摘要
深度学习(DL)显示了其在各种领域的繁荣。 DL模型的开发是一种耗时且资源密集型的程序。因此,专用的GPU加速器已被共同构建到GPU数据中心中。此类GPU数据中心的有效调度程序设计对于降低运营成本并改善资源利用率至关重要。但是,专为大数据或高性能计算工作负载而设计的传统方法无法支持DL工作负载以充分利用GPU资源。最近,建议大量调度程序为GPU数据中心的DL工作负载量身定制。本文调查了针对培训和推理工作量的现有研究工作。我们主要介绍现有调度程序如何从计划目标和资源消耗功能中促进各自的工作负载。最后,我们向一些有希望的未来研究方向前进。通过调查的论文和代码链接进行更详细的摘要,请访问我们的项目网站:https://github.com/s-lab-system-group/awesome-dless-dl-scheduling-papers
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features. Finally, we prospect several promising future research directions. More detailed summary with the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers