论文标题
最佳及时沟通可以在传统硬件上提高性能和稳定性
Best-Effort Communication Improves Performance and Scales Robustly on Conventional Hardware
论文作者
论文摘要
在这里,我们测试了现有的,商业上可用的HPC硬件的完全同步,最佳及时通信的性能和可扩展性。 与传统的完美沟通模型相比,第一组实验测试了最佳及时沟通策略是否可以使性能受益。在高CPU计数下,最佳富度通信改善了每单位时间执行的计算步骤数和在固定持续运行窗口中获得的解决方案质量。 在最佳胜地模型下,表征跨处理组件的服务质量分布和随着时间的流逝对于理解所执行的实际计算至关重要。此外,在最佳富早光模型下的可伸缩性的完整图片需要分析此类服务质量票价如何大规模的分析。为了回答这些问题,我们设计并测量了一套服务质量指标:仿真更新期,消息延迟,消息传递失败率和消息传递凝结。在较低的通信英式基准参数化下,我们发现所有服务质量指标的中值值在64到256个过程时都稳定。在最大的沟通态度下,我们发现中位服务质量的较小且在大多数情况下,零 - 零。 在另一组实验中,我们测试了一个看似故障的计算节点对性能和服务质量的影响。尽管该节点及其集团之间的服务质量极高,但中位数性能和服务质量仍然稳定。
Here, we test the performance and scalability of fully-asynchronous, best-effort communication on existing, commercially-available HPC hardware. A first set of experiments tested whether best-effort communication strategies can benefit performance compared to the traditional perfect communication model. At high CPU counts, best-effort communication improved both the number of computational steps executed per unit time and the solution quality achieved within a fixed-duration run window. Under the best-effort model, characterizing the distribution of quality of service across processing components and over time is critical to understanding the actual computation being performed. Additionally, a complete picture of scalability under the best-effort model requires analysis of how such quality of service fares at scale. To answer these questions, we designed and measured a suite of quality of service metrics: simulation update period, message latency, message delivery failure rate, and message delivery coagulation. Under a lower communication-intensivity benchmark parameterization, we found that median values for all quality of service metrics were stable when scaling from 64 to 256 process. Under maximal communication intensivity, we found only minor -- and, in most cases, nil -- degradation in median quality of service. In an additional set of experiments, we tested the effect of an apparently faulty compute node on performance and quality of service. Despite extreme quality of service degradation among that node and its clique, median performance and quality of service remained stable.