GPU专业的推理参数服务器，用于大规模深度建议模型

论文标题

GPU专业的推理参数服务器，用于大规模深度建议模型

A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models

论文作者

Wei, Yingcan, Langer, Matthias, Yu, Fan, Lee, Minseok, Liu, Kingsley, Shi, Jerry, Wang, Joey

论文摘要

推荐系统对于各种现代应用程序和网络服务至关重要，例如新闻提要，社交网络，电子商务，搜索等，为了达到峰值预测准确性，现代推荐模型将深度学习与Terabyte级嵌入式表相结合，以获取基础数据的细粒度表示。传统的推理架构需要将整个模型部署到独立的服务器上，这在如此庞大的规模上是不可行的。在本文中，我们提供了有关在线推荐系统的有趣和挑战的推理领域的见解。我们提出了HUGECTR层次参数服务器（HPS），这是一种行业领先的分配建议推理框架，它结合了高性能的GPU嵌入缓存与分层存储架构，以实现在线模型推进任务的嵌入式嵌入式嵌入式。除其他外，HPS功能（1）一个冗余的层次存储系统，（2）一种新型的高带宽缓存，以加速对NVIDIA GPU的平行嵌入查找，（3）在线培训支持和（4）（4）轻量级API，以便于在现有的大型建议工作集中轻松整合。为了证明其功能，我们使用合成工程和公共数据集进行了广泛的研究。我们表明，我们的HP可以大大减少端到端的推理潜伏期，从而在CPU基线实现中实现5〜62X的速度（取决于批处理大小），以获得受欢迎的建议模型。通过多GPU并发部署，HPS还可以大大增加推理QPS。

Recommendation systems are of crucial importance for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc. To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data. Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale. In this paper, we provide insights into the intriguing and challenging inference domain of online recommendation systems. We propose the HugeCTR Hierarchical Parameter Server (HPS), an industry-leading distributed recommendation inference framework, that combines a high-performance GPU embedding cache with an hierarchical storage architecture, to realize low-latency retrieval of embeddings for online model inference tasks. Among other things, HPS features (1) a redundant hierarchical storage system, (2) a novel high-bandwidth cache to accelerate parallel embedding lookup on NVIDIA GPUs, (3) online training support and (4) light-weight APIs for easy integration into existing large-scale recommendation workflows. To demonstrate its capabilities, we conduct extensive studies using both synthetically engineered and public datasets. We show that our HPS can dramatically reduce end-to-end inference latency, achieving 5~62x speedup (depending on the batch size) over CPU baseline implementations for popular recommendation models. Through multi-GPU concurrent deployment, the HPS can also greatly increase the inference QPS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题