Axi-pack：用于带宽有效的不规则工作负载的近内存总线包装

论文标题

Axi-pack：用于带宽有效的不规则工作负载的近内存总线包装

AXI-Pack: Near-Memory Bus Packing for Bandwidth-Efficient Irregular Workloads

论文作者

Zhang, Chi, Scheffler, Paul, Benz, Thomas, Perotti, Matteo, Benini, Luca

论文摘要

现代处理器和内存系统对常规，连续数据高度优化了涉及不规则内存流的数据密集型应用程序。最近的工作通过核心端流扩展或内存侧饲料和加速器来解决硬件中的这些低效率，但未能提供端到端的解决方案，这些解决方案也可以在芯片互连上实现高效率。我们提出了Axi-pack，这是对ARM AXI4协议的扩展，引入带宽有效的带宽和间接突发，以实现端到端的不规则流。 Axi-pack将不规则的流语义添加到内存请求中，并通过将多个狭窄的数据元素包装到宽的总线上，从而避免窄型公共传输效率低下。它保留与AXI4的完全兼容性，并且不需要修改非爆炸互连IPS。为了展示我们的方法端到端，我们扩展了一个开源RISC-V Vector处理器，以在其内存接口处利用Axi-Pack进行稳步和索引的访问。在内存方面，我们设计了一个库存的内存控制器，可以有效地处理Axi-Pack请求。 On a system with a 256-bit-wide interconnect running FP32 workloads, AXI-Pack achieves near-ideal peak on-chip bus utilizations of 87% and 39%, speedups of 5.4x and 2.4x, and energy efficiency improvements of 5.3x and 2.1x over a baseline using an AXI4 bus on strided and indirect benchmarks, respectively.

Data-intensive applications involving irregular memory streams are inefficiently handled by modern processors and memory systems highly optimized for regular, contiguous data. Recent work tackles these inefficiencies in hardware through core-side stream extensions or memory-side prefetchers and accelerators, but fails to provide end-to-end solutions which also achieve high efficiency in on-chip interconnects. We propose AXI-Pack, an extension to ARM's AXI4 protocol introducing bandwidth-efficient strided and indirect bursts to enable end-to-end irregular streams. AXI-Pack adds irregular stream semantics to memory requests and avoids inefficient narrow-bus transfers by packing multiple narrow data elements onto a wide bus. It retains full compatibility with AXI4 and does not require modifications to non-burst-reshaping interconnect IPs. To demonstrate our approach end-to-end, we extend an open-source RISC-V vector processor to leverage AXI-Pack at its memory interface for strided and indexed accesses. On the memory side, we design a banked memory controller efficiently handling AXI-Pack requests. On a system with a 256-bit-wide interconnect running FP32 workloads, AXI-Pack achieves near-ideal peak on-chip bus utilizations of 87% and 39%, speedups of 5.4x and 2.4x, and energy efficiency improvements of 5.3x and 2.1x over a baseline using an AXI4 bus on strided and indirect benchmarks, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题