论文标题
爱马仕:通过基于感知芯片的片外负载预测加速长期负载请求
Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction
论文作者
论文摘要
长期负载请求继续限制高性能处理器的性能。为了提高处理器的潜伏能力,建筑师主要依赖两种关键技术:复杂的数据预脱水和大型芯片固定粘贴。在这项工作中,我们表明:1)即使是先进的先进的预摘要,也只能预测一半的外芯片负载请求中的一半,在各种工作量中,平均而言,以及2)2)由于芯片上缓存的尺寸和复杂性的增加,大部分芯片载荷的延迟延迟的延迟量会访问chip cache hierararkyarchy in-Cache hierarkaryarchy。这项工作的目的是通过从其关键路径上删除片上缓存访问延迟来加速片外负载请求。为此,我们提出了一种称为爱马仕(Hermes)的新技术,其关键想法是:1)准确预测哪些负载请求可能会偏离芯片,2)2)推测从主内存中直接从主内存中直接从主内存中载荷所需的数据,同时也同时访问此类负载的缓存HIERARCALY。为了启用爱马仕,我们开发了一种新的轻巧,基于智障的外芯片加载预测技术,该技术学会使用多个程序功能(例如,程序计数器的序列)来识别芯片外负载请求。对于每个负载请求,预测变量都会观察一组程序功能,以预测负载是否会偏外。如果预计负载会外芯片,Hermes一旦生成负载的物理地址,就会直接向内存控制器发出投机请求。如果预测是正确的,则负载最终会错过缓存层次结构,并等待正在进行的投机请求完成,从而将芯片上缓存层次结构访问延迟隐藏在离芯片外负载的关键路径中。我们的评估表明,爱马仕显着提高了最先进的基线的性能。我们开源爱马仕。
Long-latency load requests continue to limit the performance of high-performance processors. To increase the latency tolerance of a processor, architects have primarily relied on two key techniques: sophisticated data prefetchers and large on-chip caches. In this work, we show that: 1) even a sophisticated state-of-the-art prefetcher can only predict half of the off-chip load requests on average across a wide range of workloads, and 2) due to the increasing size and complexity of on-chip caches, a large fraction of the latency of an off-chip load request is spent accessing the on-chip cache hierarchy. The goal of this work is to accelerate off-chip load requests by removing the on-chip cache access latency from their critical path. To this end, we propose a new technique called Hermes, whose key idea is to: 1) accurately predict which load requests might go off-chip, and 2) speculatively fetch the data required by the predicted off-chip loads directly from the main memory, while also concurrently accessing the cache hierarchy for such loads. To enable Hermes, we develop a new lightweight, perceptron-based off-chip load prediction technique that learns to identify off-chip load requests using multiple program features (e.g., sequence of program counters). For every load request, the predictor observes a set of program features to predict whether or not the load would go off-chip. If the load is predicted to go off-chip, Hermes issues a speculative request directly to the memory controller once the load's physical address is generated. If the prediction is correct, the load eventually misses the cache hierarchy and waits for the ongoing speculative request to finish, thus hiding the on-chip cache hierarchy access latency from the critical path of the off-chip load. Our evaluation shows that Hermes significantly improves performance of a state-of-the-art baseline. We open-source Hermes.