论文标题
Quantpipe:在动态边缘环境中应用自适应后训练后量化量量化的分布式变压器管道
QuantPipe: Applying Adaptive Post-Training Quantization for Distributed Transformer Pipelines in Dynamic Edge Environments
论文作者
论文摘要
管道并行性在在云环境中部署大规模变压器模型方面取得了巨大成功,但在边缘环境中受到的关注较少。与具有高速和稳定网络互连的云方案不同,边缘系统中的动态带宽可以降低分布式管道性能。我们使用QuontPipe(一种通信有效的分布式边缘系统)来解决此问题,该系统引入了培训后量化(PTQ)以压缩传达的张量。 QuantPipe使用自适应PTQ来响应带宽动力学来改变位宽,从而维持变压器管道性能,同时导致有限的推理精度损失。我们通过针对整数量化方法(DS-ACIQ)进行指示搜索分析剪辑进一步提高了精度,该剪辑弥合了估计数据分布和实际数据分布之间的差距。实验结果表明,QuontPipe适应动态带宽以保持管道性能,同时使用广泛的量化位量宽,例如,与天真量化相比,在Imagenet上提高了2位量化的准确性,在2位量化下提高了15.85 \%。
Pipeline parallelism has achieved great success in deploying large-scale transformer models in cloud environments, but has received less attention in edge environments. Unlike in cloud scenarios with high-speed and stable network interconnects, dynamic bandwidth in edge systems can degrade distributed pipeline performance. We address this issue with QuantPipe, a communication-efficient distributed edge system that introduces post-training quantization (PTQ) to compress the communicated tensors. QuantPipe uses adaptive PTQ to change bitwidths in response to bandwidth dynamics, maintaining transformer pipeline performance while incurring limited inference accuracy loss. We further improve the accuracy with a directed-search analytical clipping for integer quantization method (DS-ACIQ), which bridges the gap between estimated and real data distributions. Experimental results show that QuantPipe adapts to dynamic bandwidth to maintain pipeline performance while achieving a practical model accuracy using a wide range of quantization bitwidths, e.g., improving accuracy under 2-bit quantization by 15.85\% on ImageNet compared to naive quantization.