深度：深度学习模型的有效框架更新异构边缘

论文标题

深度：深度学习模型的有效框架更新异构边缘

Deep-Edge: An Efficient Framework for Deep Learning Model Update on Heterogeneous Edge

论文作者

Bhattacharjee, Anirban, Chhokra, Ajay Dev, Sun, Hongyang, Shekhar, Shashank, Gokhale, Aniruddha, Karsai, Gabor, Dubey, Abhishek

论文摘要

深度学习（DL）基于模型的AI服务越来越多地提供在各种预测分析服务中，例如计算机视觉，自然语言处理，语音识别。但是，由于输入数据分布的变化，DL模型的质量会随着时间的推移而降低，从而需要定期模型更新。尽管云数据中心可以满足资源密集型和耗时的模型更新任务的计算要求，但将数据从边缘设备转移到云的情况下会在网络带宽方面产生巨大的成本，并且容易出现数据隐私问题。随着启用GPU的边缘设备的出现，可以使用多个连接的边缘设备以分布式方式在边缘执行DL模型更新。但是，由于边缘设备之间的异质性以及由DL模型更新任务与后台运行的DL模型更新任务的共同定位，因此有效利用Edge资源进行模型更新是一个严重问题。为了克服这些挑战，我们提出了深边缘，一种负载和干扰，具有故障的资源管理框架，用于在使用分布式培训的边缘执行模型更新。本文做出了以下贡献。首先，它提供了一个统一的框架，用于监视，分析和部署异质边缘设备上的DL模型更新任务。其次，它提出了一个调度程序，该调度程序通过适当选择边缘设备并分发数据来减少重新训练的总训练时间，从而使没有关键延迟应用程序遭受截止日期违规。最后，我们提出了经验结果，以使用基于CalTech数据集和Edge AI群集测试台的现实世界DL模型更新案例研究验证框架的功效。

Deep Learning (DL) model-based AI services are increasingly offered in a variety of predictive analytics services such as computer vision, natural language processing, speech recognition. However, the quality of the DL models can degrade over time due to changes in the input data distribution, thereby requiring periodic model updates. Although cloud data-centers can meet the computational requirements of the resource-intensive and time-consuming model update task, transferring data from the edge devices to the cloud incurs a significant cost in terms of network bandwidth and are prone to data privacy issues. With the advent of GPU-enabled edge devices, the DL model update can be performed at the edge in a distributed manner using multiple connected edge devices. However, efficiently utilizing the edge resources for the model update is a hard problem due to the heterogeneity among the edge devices and the resource interference caused by the co-location of the DL model update task with latency-critical tasks running in the background. To overcome these challenges, we present Deep-Edge, a load- and interference-aware, fault-tolerant resource management framework for performing model update at the edge that uses distributed training. This paper makes the following contributions. First, it provides a unified framework for monitoring, profiling, and deploying the DL model update tasks on heterogeneous edge devices. Second, it presents a scheduler that reduces the total re-training time by appropriately selecting the edge devices and distributing data among them such that no latency-critical applications experience deadline violations. Finally, we present empirical results to validate the efficacy of the framework using a real-world DL model update case-study based on the Caltech dataset and an edge AI cluster testbed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题