论文标题
通过无线网络分开学习:并行设计和资源管理
Split Learning over Wireless Networks: Parallel Design and Resource Management
论文作者
论文摘要
Split Learning(SL)是一个协作学习框架,可以通过将AI模型分配到设备端模型和切割层的服务器端模型中来训练设备和边缘服务器之间的人工智能(AI)模型。现有的SL方法跨设备依次执行训练过程,这会产生巨大的训练潜伏期,尤其是当设备数量较大时。在本文中,我们设计了一种新的SL方案来减少训练潜伏期,称为基于群集的平行SL(CPSL),该计划以“先到第一顺序 - 顺序”的方式进行模型训练。具体而言,CPSL是将设备分为几个群集,每个集群中的列车设备侧模型并汇总它们,然后跨群集依次训练整个AI模型,从而并行化训练过程并减少训练潜伏期。此外,我们提出了一种资源管理算法,以最大程度地降低CPSL的训练潜伏期,考虑到无线网络中的设备异质性和网络动态。这是通过随机优化切割层选择,实时设备聚类和无线电频谱分配来实现的。提出的两次算法算法可以在小时尺度的大型时间表和设备聚类和无线电频谱分配决策中共同做出切割层选择决策。对非独立和相同分布的数据的广泛仿真结果表明,与现有的SL基准相比,所提出的解决方案可以大大降低训练潜伏期,同时适应网络动力学。
Split learning (SL) is a collaborative learning framework, which can train an artificial intelligence (AI) model between a device and an edge server by splitting the AI model into a device-side model and a server-side model at a cut layer. The existing SL approach conducts the training process sequentially across devices, which incurs significant training latency especially when the number of devices is large. In this paper, we design a novel SL scheme to reduce the training latency, named Cluster-based Parallel SL (CPSL) which conducts model training in a "first-parallel-then-sequential" manner. Specifically, the CPSL is to partition devices into several clusters, parallelly train device-side models in each cluster and aggregate them, and then sequentially train the whole AI model across clusters, thereby parallelizing the training process and reducing training latency. Furthermore, we propose a resource management algorithm to minimize the training latency of CPSL considering device heterogeneity and network dynamics in wireless networks. This is achieved by stochastically optimizing the cut layer selection, real-time device clustering, and radio spectrum allocation. The proposed two-timescale algorithm can jointly make the cut layer selection decision in a large timescale and device clustering and radio spectrum allocation decisions in a small timescale. Extensive simulation results on non-independent and identically distributed data demonstrate that the proposed solutions can greatly reduce the training latency as compared with the existing SL benchmarks, while adapting to network dynamics.