论文标题
用于基于片上网络的多周的近乎理想的易耐断层路由算法的设计
Design of a Near-Ideal Fault-Tolerant Routing Algorithm for Network-on-Chip-Based Multicores
论文作者
论文摘要
随着无情的CMOS技术,芯片上的网络(NOC)不可避免地会越来越升高耐磨性和降低可靠性。尽管处理器和记忆的故障可能会通过冗余掩盖,或者通过诸如任务迁移等技术缓解,但NOC尤其容易受到硬件故障的影响,因为单个链接崩溃可能会导致无限期地停止通信,从而使整个多项多项芯片无法操作。因此,NOC施加了利用它们的芯片多头芯片中关键失败点的风险。在存在错误的链接的情况下,我们提出了爱马仕(Hermes),旨在实现无缝的NOC操作,这是一种近乎理想的耐断层路由算法,符合表现出高水平鲁棒性,在分布式模式下运行的高水平的目标,以确保在僵局中进行自由,并在许多人的夜间交通中自由。爱马仕(Hermes)是一种限制的无僵硬的混合路由算法,利用无故障路径上的负载平衡路由来维持高通量,同时在故障附近提供了预先配置的逃生路径选择。在这种在线机制下,爱马仕的表现随着越来越多的链接计数而优雅地降低,这是对先前ART缺乏的至关重要的回应。此外,爱马仕(Hermes)在拓扑上分布着错误的链接的情况下确定了非交通网络分区,以使得由于在子网络边界处始于在子网络边界处开始,因此将数据包路由到物理隔离的区域导致无网络停滞。一项广泛的实验评估,包括利用从全系统芯片多处理器仿真收集的交通工作负载,表明,与最先进的艺术品相比,Hermes将网络吞吐量提高了最高$ 3 \ times。此外,硬件合成结果证明了爱马仕的功效。
With relentless CMOS technology downsizing Networks-on-Chips (NoCs) are inescapably experiencing escalating susceptibility to wearout and reduced reliability. While faults in processors and memories may be masked via redundancy, or mitigated via techniques such as task migration, NoCs are especially vulnerable to hardware faults as a single link breakdown may cause inter-tile communication to halt indefinitely, rendering the whole multicore chip inoperable. As such, NoCs impose the risk of becoming the pivotal point of failure in chip multicores that utilize them. Aiming towards seamless NoC operation in the presence of faulty links we propose Hermes, a near-ideal fault-tolerant routing algorithm that meets the objectives of exhibiting high levels of robustness, operating in a distributed mode, guaranteeing freedom from deadlocks, and evening-out traffic, among many. Hermes is a limited-overhead deadlock-free hybrid routing algorithm, utilizing load-balancing routing on fault-free paths to sustain high-throughput, while providing pre-reconfigured escape path selection in the vicinity of faults. Under such online mechanisms, Hermes's performance degrades gracefully with increasing faulty link counts, a crucially desirable response lacking in prior-art. Additionally, Hermes identifies non-communicating network partitions in scenarios where faulty links are topologically densely distributed such that packets being routed to physically isolated regions cause no network stagnation due to indefinite chained blockages starting at sub-network boundaries. An extensive experimental evaluation, including utilizing traffic workloads gathered from full-system chip multi-processor simulations, shows that Hermes improves network throughput by up to $3\times$ when compared against the state-of-the-art. Further, hardware synthesis results prove Hermes's efficacy.