从RDMA到RDCA：使用远程直接缓存访问数据中心网络的高速最后一英里

论文标题

从RDMA到RDCA：使用远程直接缓存访问数据中心网络的高速最后一英里

From RDMA to RDCA: Toward High-Speed Last Mile of Data Center Networks Using Remote Direct Cache Access

论文作者

Li, Qiang, Xiang, Qiao, Liu, Derui, Wang, Yuxin, Qiu, Haonan, Wang, Xiaoliang, Zhang, Jie, Wen, Ridi, Song, Haohao, Tian, Gexiao, Huang, Chenyang, Chen, Lulu, Liu, Shaozong, Wu, Yaohui, Wu, Zhiwu, Luo, Zicheng, Shao, Yuchao, Han, Chao, Wu, Zhongjie, Dong, Jianbo, Cao, Zheng, Wu, Jinbo, Shu, Jiwu, Wu, Jiesheng

论文摘要

在本文中，我们进行了系统的测量研究，以表明现代分布式应用的高存储器带宽消耗可以导致大量的网络吞吐量下降，并且在高速RDMA网络中，尾巴潜伏期的大幅度增加。我们将其根本原因确定为应用程序和网络过程之间记忆带宽的高度争论。这种争论导致接收宿主的NIC频繁下降，这触发了网络的拥塞控制机制，并最终导致网络性能降解。为了解决这个问题，我们进行了一个关键的观察，即鉴于分布式存储服务，它从网络中收到的绝大多数数据最终将由CPU写入高速存储媒体（例如SSD）。因此，我们建议在处理接收到的数据时绕过主机存储器，以完全规避此性能瓶颈。特别是，我们设计了LAMDA，这是一种新型的接收器缓存处理系统，该系统消耗了少量的CPU缓存以按线速率处理从网络接收到的数据。我们实施LAMDA的原型，并在基于闭合的测试床上广泛评估其性能。结果表明，对于分布式存储应用，LAMDA将网络吞吐量提高了4.7％，存储节点上的存储器带宽消耗量为零，并且在内存带宽压力下，大块大小和小尺寸分别将网络吞吐量提高了17％和45％。 LAMDA也可以应用于对延迟敏感的HPC应用程序，该应用将其通信潜伏期降低35.1％。

In this paper, we conduct systematic measurement studies to show that the high memory bandwidth consumption of modern distributed applications can lead to a significant drop of network throughput and a large increase of tail latency in high-speed RDMA networks.We identify its root cause as the high contention of memory bandwidth between application processes and network processes. This contention leads to frequent packet drops at the NIC of receiving hosts, which triggers the congestion control mechanism of the network and eventually results in network performance degradation. To tackle this problem, we make a key observation that given the distributed storage service, the vast majority of data it receives from the network will be eventually written to high-speed storage media (e.g., SSD) by CPU. As such, we propose to bypass host memory when processing received data to completely circumvent this performance bottleneck. In particular, we design Lamda, a novel receiver cache processing system that consumes a small amount of CPU cache to process received data from the network at line rate. We implement a prototype of Lamda and evaluate its performance extensively in a Clos-based testbed. Results show that for distributed storage applications, Lamda improves network throughput by 4.7% with zero memory bandwidth consumption on storage nodes, and improves network throughput by up 17% and 45% for large block size and small size under the memory bandwidth pressure, respectively. Lamda can also be applied to latency-sensitive HPC applications, which reduces their communication latency by 35.1%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题