Pollux：对Goodput优化深度学习的共同自适应集群计划

论文标题

Pollux：对Goodput优化深度学习的共同自适应集群计划

Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning

论文作者

Qiao, Aurick, Choe, Sang Keun, Subramanya, Suhas Jayaram, Neiswanger, Willie, Ho, Qirong, Zhang, Hao, Ganger, Gregory R., Xing, Eric P.

论文摘要

Pollux通过在每个工作水平和整个群集水平上适应相互依赖的因素来提高深度学习（DL）群集的调度性能。大多数现有调度程序希望用户指定每个作业的资源数量，通常会导致资源使用效率低下。一些最近的调度程序为用户选择工作资源，但是这样做的情况下，可以将如何重新优化DL培训以更好地利用所提供的资源。 Pollux同时考虑了这两个方面。通过监视培训期间每项工作的状态，Pollux模型如何通过添加或删除资源来改变他们的好处（我们引入的新型指标，将系统吞吐量与统计效率结合在一起）。利用这些信息，动态（重新）分配了资源来改善范围范围内的好处，同时尊重公平性并不断优化每个DL作业以更好地利用这些资源。在具有实际DL作业和微量驱动模拟的实验中，相对于最先进的DL调度程序，Pollux将平均职位完成时间降低了37-50％，即使它们为每个工作提供了理想的资源和培训配置。 Pollux促进了基于更有意义的有用工作进步的DL工作中的公平性，竞争资源，并揭示了降低云环境中DL成本的新机会。 Pollux是在https://github.com/petuum/adaptdl上实施并公开可用的。

Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors both at the per-job level and at the cluster-wide level. Most existing schedulers expect users to specify the number of resources for each job, often leading to inefficient resource use. Some recent schedulers choose job resources for users, but do so without awareness of how DL training can be re-optimized to better utilize the provided resources. Pollux simultaneously considers both aspects. By monitoring the status of each job during training, Pollux models how their goodput (a novel metric we introduce that combines system throughput with statistical efficiency) would change by adding or removing resources. Leveraging these information, Pollux dynamically (re-)assigns resources to improve cluster-wide goodput, while respecting fairness and continually optimizing each DL job to better utilize those resources. In experiments with real DL jobs and with trace-driven simulations, Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers, even when they are provided with ideal resource and training configurations for every job. Pollux promotes fairness among DL jobs competing for resources based on a more meaningful measure of useful job progress, and reveals a new opportunity for reducing DL cost in cloud environments. Pollux is implemented and publicly available as part of an open-source project at https://github.com/petuum/adaptdl.

下载PDF全文

下载文献需遵守相关版权规定

论文标题