没有线程的助手：定制的拖延不规则负载

论文标题

没有线程的助手：定制的拖延不规则负载

Helper Without Threads: Customized Prefetching for Delinquent Irregular Loads

论文作者

Sankaranarayanan, Karthik, Lin, Chit-Kwan, Chinya, Gautham

论文摘要

云和大数据应用程序的记忆范围不断增长，这意味着数据中心CPU可以花费大量时间等待内存。在这种集中式计算设置中，提高性能的一种有吸引力的方法是采用每个应用程序定制的预摘要，可以轻松地在数千台机器上缩放增益。助手线程预取的辅助是一种技术，但尚未实现广泛的采用，因为它需要备用线程上下文或特殊的硬件/固件支持。在本文中，我们提出了一种内联软件预取技术，该技术通过将辅助代码插入主线程本身来克服这些限制。我们的方法是互补的，并且不会干扰现有的硬件预摘要，因为我们仅针对拖延不规则的负载指令（没有恒定或稳步地址模式的指令）。对于每个选择的负载指令，我们生成并插入从中提取并模仿应用程序数据流的自定义软件预摘要，而无需访问应用程序源代码。对于一组不规则的工作负载，我们在最近的高端硬件（Intel Skylake）（Intel Skylake）上最多提高了2倍的单线程性能，并且由于缺乏线程的螺纹质量散产生，因此在同一硬件上实现了辅助线程的高速公司。

The growing memory footprints of cloud and big data applications mean that data center CPUs can spend significant time waiting for memory. An attractive approach to improving performance in such centralized compute settings is to employ prefetchers that are customized per application, where gains can be easily scaled across thousands of machines. Helper thread prefetching is such a technique but has yet to achieve wide adoption since it requires spare thread contexts or special hardware/firmware support. In this paper, we propose an inline software prefetching technique that overcomes these restrictions by inserting the helper code into the main thread itself. Our approach is complementary to and does not interfere with existing hardware prefetchers since we target only delinquent irregular load instructions (those with no constant or striding address patterns). For each chosen load instruction, we generate and insert a customized software prefetcher extracted from and mimicking the application's dataflow, all without access to the application source code. For a set of irregular workloads that are memory-bound, we demonstrate up to 2X single-thread performance improvement on recent high-end hardware (Intel Skylake) and up to 83% speedup over a helper thread implementation on the same hardware, due to the absence of thread spawning overhead.

下载PDF全文

下载文献需遵守相关版权规定

论文标题