论文标题

竞技场:异步可重构加速器环以启用以数据为中心的并行计算

ARENA: Asynchronous Reconfigurable Accelerator Ring to Enable Data-Centric Parallel Computing

论文作者

Tan, Cheng, Xie, Chenhao, Geng, Tong, Marquez, Andres, Tumeo, Antonino, Barker, Kevin, Li, Ang

论文摘要

由于硬件专业化的趋势和数据驱动的应用程序的出现,下一代HPC和数据中心很可能是可重新配置和以数据为中心的。在本文中,我们提出了竞技场 - 一种异步的可重新配置加速器环架构,是关于未来HPC和数据中心的潜在方案。尽管使用粗粒粒的可重构阵列(CGRA)作为底物平台,但我们的主要贡献不仅是CGRA群集设计本身,而且是新的体系结构和编程模型的合奏,可以使跨跨可重新配置的节点进行异步任务,从而使专业计算的专用计算来为数据带来数据,而不是相反。我们假定分布式数据存储,而无需主张有关数据分布的任何先验知识。硬件专业化发生在任务找到所需的大多数数据时的运行时发生。换句话说,我们动态生成数据驻留的专业CGRA加速器。在通过快速环网络连接的CGRA群集中,将任务令牌描述了要执行的数据流图来实现用于将计算带到数据的异步任务。对一组HPC和数据驱动的应用程序的评估表明,竞技场可以通过降低的数据移动提供更好的并行可伸缩性(53.9%)。与当代以计算为中心的平行模型相比,竞技场平均可以带来4.37倍的速度。合成的CGRA及其任务分配器在45nm工艺技术下仅占据2.93mm^2芯片区域,并且可以平均以800MHz运行,平均759.8MW功率消耗。 Arena还支持同时执行多应用程序,为未来的高性能并行计算和数据分析系统提供理想的体系结构支持。

The next generation HPC and data centers are likely to be reconfigurable and data-centric due to the trend of hardware specialization and the emergence of data-driven applications. In this paper, we propose ARENA -- an asynchronous reconfigurable accelerator ring architecture as a potential scenario on how the future HPC and data centers will be like. Despite using the coarse-grained reconfigurable arrays (CGRAs) as the substrate platform, our key contribution is not only the CGRA-cluster design itself, but also the ensemble of a new architecture and programming model that enables asynchronous tasking across a cluster of reconfigurable nodes, so as to bring specialized computation to the data rather than the reverse. We presume distributed data storage without asserting any prior knowledge on the data distribution. Hardware specialization occurs at runtime when a task finds the majority of data it requires are available at the present node. In other words, we dynamically generate specialized CGRA accelerators where the data reside. The asynchronous tasking for bringing computation to data is achieved by circulating the task token, which describes the data-flow graphs to be executed for a task, among the CGRA cluster connected by a fast ring network. Evaluations on a set of HPC and data-driven applications across different domains show that ARENA can provide better parallel scalability with reduced data movement (53.9%). Compared with contemporary compute-centric parallel models, ARENA can bring on average 4.37x speedup. The synthesized CGRAs and their task-dispatchers only occupy 2.93mm^2 chip area under 45nm process technology and can run at 800MHz with on average 759.8mW power consumption. ARENA also supports the concurrent execution of multi-applications, offering ideal architectural support for future high-performance parallel computing and data analytics systems.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源