论文标题
宙斯:理解和优化DNN培训的GPU能耗
Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training
论文作者
论文摘要
培训深度神经网络(DNNS)每年变得越来越多地资源和能源密集型。不幸的是,现有作品主要集中于优化DNN培训以更快完成,而无需考虑对能源效率的影响。 在本文中,我们观察到改善训练绩效的常见实践通常会导致能源使用效率低下。更重要的是,我们证明能源消耗与性能优化之间存在权衡。为此,我们提出了宙斯,这是一个优化框架,可以自动找到用于重复出现的DNN培训工作的最佳作业和GPU级配置,以导航这种权衡。宙斯与即将到来的能源分析一起使用在线探索 - 开发方法,避免了对昂贵的离线测量的需求,同时适应数据随着时间的流逝。我们的评估表明,宙斯可以将DNN培训的能源效率提高15.3%-75.8%。
Training deep neural networks (DNNs) is becoming increasingly more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the impact on energy efficiency. In this paper, we observe that common practices to improve training performance can often lead to inefficient energy usage. More importantly, we demonstrate that there is a tradeoff between energy consumption and performance optimization. To this end, we propose Zeus, an optimization framework to navigate this tradeoff by automatically finding optimal job- and GPU-level configurations for recurring DNN training jobs. Zeus uses an online exploration-exploitation approach in conjunction with just-in-time energy profiling, averting the need for expensive offline measurements, while adapting to data drifts over time. Our evaluation shows that Zeus can improve the energy efficiency of DNN training by 15.3%-75.8% for diverse workloads.