Turbotransformer：用于变压器模型的高效GPU服务系统

论文标题

Turbotransformer：用于变压器模型的高效GPU服务系统

TurboTransformers: An Efficient GPU Serving System For Transformer Models

论文作者

Fang, Jiarui, Yu, Yang, Zhao, Chengduo, Zhou, Jie

论文摘要

变压器是近年来自然语言处理（NLP）领域最关键的算法创新。与复发性神经网络（RNN）模型不同，变压器可以在平行的序列长度上进行处理，从而导致长序列的准确性更好。但是，在配备GPU的数据中心中为在线服务中有效部署并不容易。首先，变压器结构引入的更多计算使得满足服务的延迟和吞吐量约束更具挑战性。其次，NLP任务采用可变长度的句子。输入维度的可变性为有效的记忆管理和服务优化带来了严重的问题。本文设计了一个称为Turbotansformers的变压器服务系统，该系统由计算运行时和解决上述挑战的服务框架组成。三个创新功能使其在其他类似作品中脱颖而出。提出了一种有效的平行算法，用于基于GPU的批次减少操作，例如SoftMax和Layernorm，除了BLAS例程外，主要的热点是主要的热点。一种可更好地平衡内存足迹和分配/自由效率的内存分配算法是针对可变长度输入情况设计的。使用动态编程配备新批处理调度程序的服务框架可在可变长度请求上实现最佳吞吐量。该系统可以在GPU平台上实现最先进的变压器模型，并可以通过几行代码无缝地集成到Pytorch代码中。

The transformer is the most critical algorithm innovation of the Nature Language Processing (NLP) field in recent years. Unlike the Recurrent Neural Network (RNN) models, Transformers can process on dimensions of sequence lengths in parallel, therefore leading to better accuracy on long sequences. However, efficient deployments of them for online services in data centers equipped with GPUs are not easy. First, more computation introduced by transformer structures makes it more challenging to meet the latency and throughput constraints of serving. Second, NLP tasks take in sentences of variable length. The variability of input dimensions brings a severe problem to efficient memory management and serving optimization. This paper designed a transformer serving system called TurboTransformers, which consists of a computing runtime and a serving framework to solve the above challenges. Three innovative features make it stand out from other similar works. An efficient parallel algorithm is proposed for GPU-based batch reduction operations, like Softmax and LayerNorm, major hot spots besides BLAS routines. A memory allocation algorithm, which better balances the memory footprint and allocation/free efficiency, is designed for variable-length input situations. A serving framework equipped with a new batch scheduler using dynamic programming achieves the optimal throughput on variable-length requests. The system can achieve the state-of-the-art transformer model serving performance on GPU platforms and can be seamlessly integrated into your PyTorch code with a few lines of code.

下载PDF全文

下载文献需遵守相关版权规定

论文标题